fix(#4372): snapshot leaderServer to prevent null connectToServer in httpCommand retry#4378
Conversation
…httpCommand retry
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 0 |
🟢 Coverage 75.00% diff coverage · -7.46% coverage variation
Metric Results Coverage variation ✅ -7.46% coverage variation Diff coverage ✅ 75.00% diff coverage Coverage variation details
Coverable lines Covered lines Coverage Common ancestor commit (3ad0c09) 127798 93872 73.45% Head commit (f162bc9) 159477 (+31679) 105247 (+11375) 66.00% (-7.46%) Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch:
<coverage of head commit> - <coverage of common ancestor commit>Diff coverage details
Coverable lines Covered lines Diff coverage Pull request (#4378) 4 3 75.00% Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified:
<covered lines added or modified>/<coverable lines added or modified> * 100%
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Code Review
This pull request addresses issue #4372 by resolving a null-read race condition on leaderServer in the retry path of RemoteHttpComponent.httpCommand. It introduces volatile fields and snapshots leaderServer to prevent premature loop exits when the leader is concurrently set to null. The feedback highlights critical concurrency issues that remain: updating currentServer and currentPort independently is not atomic and can lead to mismatched host/port states, and other fields like currentReplicaServerIndex and replicaServerList are still not thread-safe, which could cause runtime exceptions under concurrent access.
| protected volatile String currentServer; | ||
| protected volatile int currentPort; |
There was a problem hiding this comment.
While making currentServer and currentPort volatile ensures that updates to each individual field are visible to other threads, it does not guarantee atomicity across updates to both fields. Because they are updated independently (e.g., in reloadClusterConfiguration()), a concurrent reader thread calling getUrl() or getDatabasePath() could read a new currentServer with an old currentPort (or vice versa), leading to a mismatched host/port connection.\n\nTo prevent this split-state race condition, consider encapsulating the current server address in a single volatile reference, such as a Pair<String, Integer> or a dedicated HostAddress object, and updating/reading that reference atomically.
| protected String currentServer; | ||
| protected int currentPort; | ||
| private volatile Pair<String, Integer> leaderServer; | ||
| private int currentReplicaServerIndex = -1; |
There was a problem hiding this comment.
Since this PR introduces volatile fields to support multi-threaded usage of RemoteHttpComponent, please note that currentReplicaServerIndex and replicaServerList are still not thread-safe.\n\nSpecifically:\n1. currentReplicaServerIndex is incremented and checked non-atomically in getNextReplicaAddress(). Under concurrent access, this can lead to race conditions where the index exceeds the bounds of replicaServerList, throwing an IndexOutOfBoundsException.\n2. replicaServerList is a standard ArrayList which is modified (cleared and populated) in requestClusterConfiguration() and read in getNextReplicaAddress(). Concurrent modification and traversal of an ArrayList without synchronization will lead to ConcurrentModificationException or undefined behavior.\n\nTo fully support concurrent usage, consider using thread-safe alternatives such as AtomicInteger for the index and a thread-safe list (e.g., CopyOnWriteArrayList or proper synchronization) for the replica list.
Code ReviewThe core fix is correct and clearly described. Snapshotting A few things worth addressing: 1.
|
Code Review: fix(#4372) - snapshot leaderServer to prevent null connectToServer in httpCommand retryOverviewThis PR fixes a real concurrency bug in RemoteHttpComponent.httpCommand: during the retry path, leaderServer was read twice without snapshotting, allowing a concurrent call to reloadClusterConfiguration() (which temporarily sets leaderServer = null) to race in between the two reads. The result was connectToServer = null, which caused the for-loop to silently exit without trying any available replica. The fix is correct and minimal. A few observations below. What's Good
Issues and Suggestions1. Two docs files should not be committed (medium) The PR adds docs/4372-remotehttp-leader-null-snapshot.md and docs/review-deferred-518a5ce3.md. These are working notes - the first duplicates the PR description and commit message; the second is an internal review-cycle artifact referencing a specific git SHA that will be meaningless to future readers. Both should be dropped from the commit. The PR body is the right place for this analysis. 2. getLeaderAddress() can still NPE (minor) With leaderServer now volatile, a concurrent null write is immediately visible to callers of getLeaderAddress(), making an NPE slightly more likely to surface than before. This is pre-existing, but consider adding a null check or documenting that callers must ensure leaderServer is not null. 3. volatile alone does not cover the check-then-act in reloadClusterConfiguration() (informational) The PR description correctly notes the class is "not thread safe". Just capturing it here: volatile gives visibility, not atomicity. The fix is correct within the documented single-caller assumption, but making only leaderServer volatile while leaving the rest of the state unguarded creates a partial-visibility situation worth tracking in a follow-up issue (as the deferred doc itself notes). 4. Test server thread has no socket timeout (minor) If the client never completes the HTTP header (e.g., test is aborted), the server thread hangs indefinitely. A client.setSoTimeout(5000) after accept() would make cleanup more reliable in CI. 5. Test uses reflection to access replicaServerList (minor) TestableRemoteHttpComponent already uses subclassing to override methods. A protected void addReplica(Pair<String, Integer> replica) test seam on the base class would be less brittle and more consistent with the existing pattern. 6. @SuppressWarnings("unchecked") scoped to entire test method (nit) The unchecked cast is isolated to a single line. Narrowing the annotation scope would be cleaner. SummaryThe core fix is correct and well-targeted. The main action before merge is removing the two docs-directory markdown files (item 1). Everything else is minor. The regression test covers the scenario effectively. |
Code Review - PR #4378Fix: snapshot OverviewThis PR correctly identifies and fixes a real TOCTOU race in What's Good
Issues and Suggestions1. Documentation files do not belong in
2. Test uses reflection to access final Field f = RemoteHttpComponent.class.getDeclaredField("replicaServerList");
f.setAccessible(true);
((List<Pair<String, Integer>>) f.get(this)).add(new Pair<>("127.0.0.1", replicaPort));This is fragile - a field rename breaks the test silently at runtime rather than compile time. A package-private or protected setter/accessor on 3. Test contains em dashes in comments (low - style) Per CLAUDE.md: "Never use the em dash character ( // Primary server: closed port — triggers ConnectException (IOException) on iteration 0
// With the fix: httpCommand snapshots leaderServer, detects null, falls throughReplace the 4. Raw HTTP mock in test is fragile (low) The test stands up a raw TCP socket that manually reads HTTP headers and returns 5. Missing The test binds real OS sockets. It's fast, but worth confirming it runs cleanly in the module's CI suite on all platforms (the Correctness VerificationThe fix is logically correct: final Pair<String, Integer> snapshotLeader = leaderServer; // volatile read, single snapshot
if (leaderIsPreferable && snapshotLeader != null && !currentConnectToServer.equals(snapshotLeader)) {
connectToServer = snapshotLeader;
} else
connectToServer = getNextReplicaAddress();
SummaryThe core fix is correct and well-scoped. The two blocking items before merge are:
The deferred thread-safety items ( |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4378 +/- ##
============================================
- Coverage 64.37% 64.32% -0.06%
+ Complexity 447 428 -19
============================================
Files 1647 1647
Lines 127798 127799 +1
Branches 27395 27395
============================================
- Hits 82276 82209 -67
- Misses 33957 34052 +95
+ Partials 11565 11538 -27 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Closes #4372
RemoteHttpComponent.httpCommandreadleaderServertwice in its retry path. The first read (pre-loop, line ~237) was already null-safe. The second read (retry catch block, line ~349) was not: afterreloadClusterConfiguration()returnstrue, a concurrent call to the same method can setleaderServer = nullbefore the second read executes. The result wasconnectToServer = null, which made the for-loop exit prematurely and abandon the request without attempting any available replica.The fix snapshots
leaderServerinto a local variable once afterreloadClusterConfiguration()completes, and adds an explicit null guard before assigning toconnectToServer. When the snapshot is null, the code falls through togetNextReplicaAddress()so any available replica is tried instead of silently giving up.leaderServer,currentServer, andcurrentPortare also markedvolatileso writes from one thread are immediately visible to concurrent readers.Test plan
RemoteHttpComponentTest.httpCommandRetryWithReplicaWhenLeaderNulledConcurrently: overridesreloadClusterConfiguration()to returntruewithleaderServer = nulland a working replica in the list. ConfiguresNETWORK_SAME_SERVER_ERROR_RETRIES = 2somaxRetry = 2. Connects to a closed port (primary) - triggers IOException - then verifieshttpCommandretries with the replica and returns successfully instead of throwing.🤖 Generated with Claude Code