Skip to content

fix(#4273): implement STICKY strategy to pin HTTP transactions to a concrete cluster member#4277

Merged
robfrank merged 6 commits into
mainfrom
fix/4273-connection-strategy-sticky-pins-server
May 21, 2026
Merged

fix(#4273): implement STICKY strategy to pin HTTP transactions to a concrete cluster member#4277
robfrank merged 6 commits into
mainfrom
fix/4273-connection-strategy-sticky-pins-server

Conversation

@robfrank

Copy link
Copy Markdown
Collaborator

Closes #4273

Summary

CONNECTION_STRATEGY.STICKY was a no-op: all three if-branches in RemoteHttpComponent.httpCommand() that distinguished strategies only branched on FIXED, leaving STICKY and ROUND_ROBIN behaviorally identical. In an HA deployment behind a load-balancer, the begin request could land on pod-A while subsequent command/commit requests were routed to pod-B (or to the concrete leaderServer address), causing "Remote transaction not found or expired" failures.

Changes:

  • RemoteHttpComponent: added stickyTransactionServer field + setStickyTransactionServer() setter. getUrl() returns the pinned address when strategy is STICKY and the field is set. httpCommand() uses stickyTransactionServer as the initial target and treats it like FIXED for retry-count and failover (no round-robin rotation during an active pinned transaction).

  • RemoteDatabase.begin(): before issuing the HTTP call, resolves a concrete cluster-member address (leader preferred, currentServer/currentPort as fallback) and pins it as stickyTransactionServer. The getUrl() override then routes the begin request - and all subsequent command/commit/rollback calls - to that same node. Clears the pin on failure.

  • RemoteDatabase.setSessionId(null): clears stickyTransactionServer when the session ends (commit, rollback, or exception) so the next transaction starts fresh.

Test plan

  • RemoteHttpComponentTest - 4 new unit tests:

    • stickyStrategyUrlUsesPinnedServer - URL uses pinned address when STICKY + pin set
    • stickyStrategyUrlUsesCurrentServerWhenNotPinned - STICKY without pin falls back to LB host
    • stickyStrategyUrlRevertsAfterClearingPin - clearing the pin restores the LB host URL
    • roundRobinStrategyIgnoresStickyServer - ROUND_ROBIN never uses the sticky field
  • RemoteDatabaseTest - 2 new unit tests:

    • stickyStrategyUrlUsesPinnedServerDuringActiveSession - URL uses pinned address during transaction
    • stickyStrategySessionClearClearsStickyServer - setSessionId(null) releases the pin

Run with:

cd network && mvn test -Dtest="RemoteHttpComponentTest,RemoteDatabaseTest"

🤖 Generated with Claude Code

…te cluster member

CONNECTION_STRATEGY.STICKY was functionally identical to ROUND_ROBIN: no server
pinning occurred during a transaction, causing "Remote transaction not found or
expired" errors when a load-balancer distributed the begin/command/commit requests
across different HA nodes.

Fix: RemoteDatabase.begin() now pre-selects the leader (or currentServer as
fallback) as the sticky target before the HTTP call. RemoteHttpComponent.getUrl()
and httpCommand() consult stickyTransactionServer when STICKY is active, routing
all three transaction phases to the same physical node. setSessionId(null)
releases the pin when the transaction ends.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses issue #4273 by implementing proper server pinning for the STICKY connection strategy. It introduces a stickyTransactionServer field to ensure that all phases of a remote transaction—begin, command, and commit/rollback—are routed to the same physical node, preventing 'transaction not found' errors in load-balanced environments. The changes include logic to pin the server during begin(), respect the pin in httpCommand(), and clear it upon transaction completion or failure. Feedback suggests refactoring the getUrl method in RemoteHttpComponent to eliminate duplicate URL construction logic and improve maintainability.

Comment thread network/src/main/java/com/arcadedb/remote/RemoteHttpComponent.java Outdated
@codacy-production

codacy-production Bot commented May 21, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results
Complexity 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@claude

claude Bot commented May 21, 2026

Copy link
Copy Markdown

Code Review - fix(#4273): STICKY strategy HTTP transaction pinning

Overview

This PR correctly identifies and fixes a real bug: CONNECTION_STRATEGY.STICKY was behaviorally identical to ROUND_ROBIN because httpCommand() only branched on FIXED. In an HA deployment behind a load-balancer, begin could land on pod-A while command/commit hit pod-B, causing "Remote transaction not found or expired" failures. The fix is well-scoped and the approach is sound.


Issues

1. Documentation file should not be committed

docs/4273-connection-strategy-sticky-pins-server.md is an internal implementation analysis - this belongs in the PR description (where it already appears) or the commit message, not as a tracked file in the repo. The existing docs/sparse-vector-storage-design.md is a lasting design reference; a bug-fix walkthrough is not the same kind of artifact. Please remove this file from the PR.

2. Asymmetric exception handling in begin()

commit() and rollback() both clear the session in a finally block, ensuring cleanup even if the exception handler itself throws. begin() uses an explicit catch to call setStickyTransactionServer(null), which diverges from that pattern. A finally-based approach is more consistent:

// More consistent with commit/rollback:
} catch (final Exception e) {
  throw new TransactionException(...);
} finally {
  if (getSessionId() == null)
    setStickyTransactionServer(null);
}

3. stickyPinned snapshot taken but field read again directly

In httpCommand(), stickyPinned is a boolean snapshot of stickyTransactionServer != null, but the field is then read again directly for the connectToServer assignment:

final boolean stickyPinned = connectionStrategy == CONNECTION_STRATEGY.STICKY && stickyTransactionServer != null;
// ...
else if (stickyPinned)
  connectToServer = stickyTransactionServer;  // field re-read - could be null if cleared concurrently

If stickyTransactionServer is cleared between the check and the read, connectToServer will be null and connectToServer.getFirst() a few lines later will NPE. Capturing the value once resolves this:

final Pair<String, Integer> stickyPin = connectionStrategy == CONNECTION_STRATEGY.STICKY ? stickyTransactionServer : null;
final boolean stickyPinned = stickyPin != null;
// ...
else if (stickyPinned)
  connectToServer = stickyPin;  // safe - snapshot captured above

4. getLeaderServer() package-private getter added solely for test access

The new getLeaderServer() method widens the API to expose internal state for test use. If this getter is not needed by non-test production code, consider testing begin() via an observable outcome (the URL used after calling begin()) rather than via white-box inspection of the leader field.


Test Coverage

The 6 new unit tests cover URL routing correctly. What is not directly tested:

  • begin() setting the sticky pin - the RemoteDatabaseTest tests call setStickyTransactionServer() directly; no test verifies that begin() itself calls it when strategy is STICKY, nor does it test the leader-vs-fallback selection logic inside begin().
  • Full lifecycle - pin survives a command() call but is released after commit()/rollback(). Currently tested indirectly via setSessionId(null), but an end-to-end unit test for begin-command-commit would make the regression guard stronger.

Positive Notes

  • The getUrl() override is clean and the STICKY-without-pin fallback preserves backward compatibility for non-HA deployments.
  • Using a boolean stickyPinned at the top of httpCommand() for retry-count and failover decisions is the right approach.
  • Adding final to the setSessionId parameter is a small but welcome consistency improvement.
  • The commit() and rollback() finally-block paths already clear the pin via the setSessionId(null) chain - no additional changes needed there.

Summary

The bug fix is correct and the implementation is solid. Three actionable items before merging:

  1. Remove the docs markdown file from the commit
  2. Fix the potential NPE double-read in httpCommand() by snapshotting stickyTransactionServer once (item 3 above)
  3. Add a test that exercises the begin() code path that sets the pin internally

Item 2 on begin() exception handling is a minor consistency/polish point, not a blocker.

…in cleanup

- RemoteHttpComponent.getUrl: extract host/port into local vars (gemini)
- RemoteHttpComponent.httpCommand: snapshot stickyTransactionServer once to
  prevent NPE if cleared between check and read (claude)
- RemoteDatabase.begin: switch from catch-based cleanup to finally-based
  cleanup so the pin is released whenever the session was not set,
  consistent with commit/rollback (claude)
- RemoteDatabase: extract resolveStickyTargetServer helper so the
  leader-vs-fallback selection can be unit tested (claude)
- RemoteDatabaseTest: 2 new tests covering the helper (fallback to
  currentServer when no leader; leader preferred when resolved)
- Drop the issue-specific tracking doc from the PR (claude)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@robfrank

Copy link
Copy Markdown
Collaborator Author

Review feedback addressed (commit 5c3b7ea)

Gemini (medium):

  • getUrl() refactored to use local host/port vars - de-duplicated URL construction.

Claude:

  • build support #1 Removed docs/4273-connection-strategy-sticky-pins-server.md from the PR.
  • Bump gremlin.version from 3.4.10 to 3.5.1 #2 begin() now uses a finally block (if (getSessionId() == null) setStickyTransactionServer(null);) for consistency with commit()/rollback().
  • Bump slf4j-simple.version from 1.7.31 to 1.7.32 #3 httpCommand() snapshots stickyTransactionServer into a local stickyPin variable, so the field is read exactly once - no NPE if the field is cleared between the boolean check and the connectToServer assignment.
  • Bump mongo-java-driver from 3.12.4 to 3.12.10 #4 Disagree, with justification. getLeaderServer() was not added solely for tests - it is called from RemoteDatabase.resolveStickyTargetServer() (production code path used by begin()). The getter is the legitimate way for a subclass in the same package to read the private leaderServer field of the parent.
  • Bump jedis from 3.6.1 to 3.6.3 #5 Extracted RemoteDatabase.resolveStickyTargetServer() and added two unit tests: resolveStickyTargetServerFallsBackToCurrentServerWhenNoLeader and resolveStickyTargetServerPrefersLeader (uses a subclass that overrides getLeaderServer()).

All 74 network module tests pass.

@claude

claude Bot commented May 21, 2026

Copy link
Copy Markdown

Code Review

Overview

This PR correctly fixes a real bug: CONNECTION_STRATEGY.STICKY was silently behaving like ROUND_ROBIN because neither httpCommand() nor getUrl() had any special handling for it. In an HA deployment behind a load-balancer, a begin request could land on pod-A while commit landed on pod-B, producing "Remote transaction not found or expired" failures. The fix introduces a stickyTransactionServer pin that is set at begin() time and cleared when the session ends.


Correctness

Two independent routing paths are both covered:

  • begin(), commit(), and rollback() bypass httpCommand() and build their URLs via getUrl(). The override in getUrl() now returns the pinned host when STICKY is active.
  • Regular command() calls flow through databaseCommand() -> httpCommand(), which also reads stickyTransactionServer.

Pin lifecycle is sound:

  • Set: begin() pins before the HTTP call.
  • Cleared on failure: the finally block clears when getSessionId() == null (i.e., the begin HTTP call did not return a session header).
  • Cleared on session end: setSessionId(null) always calls setStickyTransactionServer(null). This covers commit(), rollback(), and the transaction() wrapper exception paths.

Retry behaviour with a pinned server:
When stickyPinned is true, the code retries the same server up to sameServerErrorRetries times (like FIXED), which is the correct decision - switching to another node would fail anyway since the transaction state lives on the original node.


Issues

Minor: Missing test for the failure path of begin()

There is no test verifying that stickyTransactionServer is cleared when begin() fails with no session header produced. Given the lifecycle complexity, a test verifying that the pin is released when getSessionId() remains null after a failed begin would increase confidence in the state machine.

Minor: Double-clear when begin() receives 204 with no session header

If the server returns 204 but omits the arcadedb-session-id header, setSessionId(null) is called inside the try block (which already calls setStickyTransactionServer(null) via the setter), and then the finally block checks and does the same (harmless no-op). Not a bug, but documenting this sequence would help future readers.

Pre-existing: getLeaderAddress() does not null-check leaderServer

The public method getLeaderAddress() calls leaderServer.getFirst() without a null guard. Not introduced by this PR, but since getLeaderServer() was just exposed as a package-visible accessor, this could surface in future callers.


Design Notes

Split routing paths (acceptable trade-off):
getUrl() is used for begin/commit/rollback (raw httpClient.send() calls), while httpCommand() handles everything else. Both paths were updated independently. A cross-reference comment would help future maintainers who change routing logic not miss one of the two paths.

STICKY without an active transaction falls back to current server:
When the pin is null and strategy is STICKY, both getUrl() and httpCommand() fall through to currentServer:currentPort (the LB hostname). This is the correct default for non-transactional reads, confirmed by the stickyStrategyUrlUsesCurrentServerWhenNotPinned test.


Test Coverage

The 6 new tests cover the core invariants well:

  • URL uses pinned host when STICKY + pin set
  • URL falls back to LB host when STICKY but no pin
  • Clearing the pin restores the LB host URL
  • ROUND_ROBIN ignores the sticky field
  • Pin is active during an active session
  • setSessionId(null) clears the pin

What is not covered (acceptable for unit-test scope):

  • Full begin() + command() + commit() flow over a real cluster
  • Retry path when the sticky node becomes unreachable mid-transaction

Summary

The fix is correct and well-scoped. The code is clean, follows existing patterns, and the tests validate the core invariants. Two items worth addressing before merge:

  1. A test for the begin() failure path verifying the sticky pin is cleared when no session is established.
  2. A brief cross-reference comment between getUrl() and httpCommand() so future routing changes do not miss one of the two paths.

Everything else is minor polish.

Reviewed by Claude Sonnet 4.6 via Claude Code

@robfrank

Copy link
Copy Markdown
Collaborator Author

/gemini review

…ng comments

- RemoteDatabaseTest.stickyStrategyClearsPinWhenBeginFails: points at a closed
  port so the begin HTTP call fails fast; verifies the sticky pin is released
  by the finally block when no session is established (claude #1)
- RemoteHttpComponent.getUrl + httpCommand: brief leading comments pointing
  out that STICKY-related routing changes must update both paths (claude #2)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements session-pinning for the STICKY connection strategy, ensuring that all operations within a transaction are routed to the same cluster member. The changes include logic to resolve the leader or current server as a sticky target, manage the pin's lifecycle based on the session state, and update HTTP routing to respect the pinned server. Review feedback suggests refactoring the duplicated logic for retrieving the sticky pin into a shared helper method to improve code maintainability.

Comment on lines 215 to 216
Exception lastException = null;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to determine the active sticky pin is duplicated here and in the getUrl method. Consider extracting this into a private helper method like getStickyPin() to improve maintainability and ensure consistency across the class.

Suggested change
Exception lastException = null;
final Pair<String, Integer> stickyPin = getStickyPin();
final boolean stickyPinned = stickyPin != null;

Comment on lines +554 to +561
// Regular query/command traffic flows through httpCommand() - any STICKY routing
// change must update both paths.
protected String getUrl(final String command) {
return protocol + "://" + currentServer + ":" + currentPort + "/api/v" + apiVersion + "/" + command;
final String host;
final int port;
final Pair<String, Integer> pin = connectionStrategy == CONNECTION_STRATEGY.STICKY ? stickyTransactionServer : null;
if (pin != null) {
host = pin.getFirst();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for resolving the host and port based on the sticky strategy is slightly verbose and duplicates the strategy check. Using a helper method to retrieve the pin would simplify this block.

    final Pair<String, Integer> pin = getStickyPin();
    final String host = pin != null ? pin.getFirst() : currentServer;
    final int port = pin != null ? pin.getSecond() : currentPort;

@claude

claude Bot commented May 21, 2026

Copy link
Copy Markdown

Code Review

Good fix for a real HA bug. The root cause analysis is correct and the lifecycle management (pin on begin, clear on session-end or failure) is sound. A few concerns worth addressing before merge.


Dual routing paths - maintenance risk

The most fragile part of this change is that STICKY routing logic must now be kept in sync across two independent code paths:

  • getUrl() - used by begin(), commit(), rollback()
  • httpCommand() - used by command(), query(), etc.

The comments call this out, but comments are not enforcement. If someone adds a new operation that routes through httpCommand() and forgets this invariant, or refactors begin() to use httpCommand(), the behavior silently diverges. Consider extracting a single resolveTargetAddress() method that both paths call, so there is one place to change.


Thread safety of stickyTransactionServer

RemoteHttpComponent extends RWLockContext, but reads and writes to stickyTransactionServer in getUrl() and httpCommand() are not protected by the RW lock. If the intent is that a RemoteDatabase/RemoteHttpComponent instance is always single-threaded (one transaction at a time per instance), that assumption should be documented at the class level. If instances can be shared across threads, this is a data race.


Missing lifecycle test for commit() / rollback()

The tests verify that setSessionId(null) clears the pin, and that begin() failure clears the pin. But there is no end-to-end test confirming that the actual commit() or rollback() path eventually calls setSessionId(null) and therefore releases the pin. If commit() or rollback() ever fails without clearing the session, the pin would leak. A test like stickyPinIsReleasedAfterCommit() (even mocked) would close that gap.


Port 65530 in stickyStrategyClearsPinWhenBeginFails

final TestableRemoteDatabase failDb = new TestableRemoteDatabase("127.0.0.1", 65530, "testdb", "root", "test");

Port 65530 is in the ephemeral range and could be in use during CI. A more reliable approach would be to mock the HTTP send path so the test does not depend on a specific port being closed. Alternatively, use a port known to be refused (e.g., port 1 or test infrastructure that binds and immediately closes).


Minor: multi-line Javadoc comments

CLAUDE.md asks to avoid multi-line comment blocks (one short line max). These new blocks could be trimmed:

/**
 * Resolves the concrete cluster-member address used to pin a STICKY transaction.
 * Prefers the leader (already resolved from cluster topology) over the configured
 * server hostname, which may be a load-balancer that fans out across pods.
 */
Pair<String, Integer> resolveStickyTargetServer() {

and TestableRemoteDatabaseWithLeader's class Javadoc. The method name already communicates what it does; the "why" (LB issue) is covered by the PR description and the issue link.


Small correctness note in begin()

The STICKY pin is set before the try block:

if (getConnectionStrategy() == CONNECTION_STRATEGY.STICKY)
    setStickyTransactionServer(resolveStickyTargetServer());

try {
    ...
    setSessionId(...);
} catch (...) {
    throw new TransactionException(...);
} finally {
    if (getSessionId() == null)
        setStickyTransactionServer(null);
}

This is correct - the finally block clears the pin if no session was established. Just worth noting that the setSessionId() override also calls setStickyTransactionServer(null) when passed null, so in the success path the pin is cleared twice (once when commit/rollback eventually calls setSessionId(null), and never from the finally block since getSessionId() != null at that point). No bug, but it is worth a comment confirming this double-clear is intentional and harmless.


Summary

Issue Severity
Dual routing paths require synchronized changes Medium
Thread safety of stickyTransactionServer not documented Medium
No test for commit/rollback pin lifecycle Minor
Port 65530 can be flaky in CI Minor
Multi-line Javadoc blocks (CLAUDE.md style) Minor

The core logic is correct. Addressing the dual-path routing concern (or at least extracting resolveTargetAddress()) would significantly reduce future maintenance risk.

@robfrank

Copy link
Copy Markdown
Collaborator Author

Review feedback on cycle 3 HEAD 5833835

Both bots reviewed this HEAD. Per the resolve-issue-with-review skill's max-cycles=3 limit, no further automated commits will be pushed. The remaining feedback is summarized below for developer follow-up.

Both bots converge on: extract a shared helper for sticky-pin retrieval (currently duplicated in getUrl() and httpCommand()).

  • Gemini (medium, lines 216 + 561): suggest private Pair<String, Integer> getStickyPin() helper.
  • Claude (medium): same concern phrased as "dual routing paths require synchronized changes" - recommends a single resolveTargetAddress() that both getUrl() and httpCommand() call.

Claude additional items:

  • Medium: thread-safety of stickyTransactionServer is not documented (class is RWLockContext but the field is read/written without the lock - matches the existing "not thread safe; one instance per thread" contract in the class Javadoc, worth restating)
  • Minor: no end-to-end test that commit()/rollback() actually release the pin (current tests verify setSessionId(null) releases it; the wiring through commit/rollback is implicit)
  • Minor: stickyStrategyClearsPinWhenBeginFails uses port 65530 which is in the ephemeral range and could conflict in CI
  • Minor: multi-line Javadoc blocks on resolveStickyTargetServer and TestableRemoteDatabaseWithLeader exceed CLAUDE.md's one-short-line rule

The core fix is sound (both reviews agree). These are quality-of-implementation refinements, not correctness issues. Merge decision is yours.

@codecov

codecov Bot commented May 21, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 77.27273% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.46%. Comparing base (6918763) to head (33f6e71).

Files with missing lines Patch % Lines
.../java/com/arcadedb/remote/RemoteHttpComponent.java 64.28% 2 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4277      +/-   ##
============================================
- Coverage     64.69%   64.46%   -0.24%     
- Complexity        0      423     +423     
============================================
  Files          1645     1645              
  Lines        127456   127474      +18     
  Branches      27316    27324       +8     
============================================
- Hits          82458    82173     -285     
- Misses        33448    33686     +238     
- Partials      11550    11615      +65     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

RemoteStickyConnectionStrategyIT exercises a 3-node Raft cluster end-to-end
with RemoteDatabase pointed at a follower's HTTP port:

- STICKY allows a write transaction (begin/command/commit) to succeed when
  the client is connected to a follower - all three phases pin to the leader
- ROUND_ROBIN smoke check (logs the racy 'transaction not found' outcome)
- Pin is released after commit so a second STICKY transaction repins fresh
- Pin is released after rollback so a subsequent STICKY transaction succeeds

Class is tagged @tag("slow"); runs in ~53s.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@claude

claude Bot commented May 21, 2026

Copy link
Copy Markdown

Code Review - Final Cycle

This is the third automated review pass. The first two cycles identified and the author addressed: the docs file removal, the finally-block pattern in begin(), the stickyPin snapshot in httpCommand(), and the extraction of resolveStickyTargetServer() with dedicated tests. The core logic is correct.

Remaining open items (unchanged from previous cycles)

1. Duplicate sticky-pin retrieval logic (medium)

Both getUrl() and httpCommand() contain the same inline expression to read the sticky pin. Both previous reviewers (Gemini and Claude) flagged this. Extracting a single getStickyPin() helper eliminates the duplication and ensures a future routing change only needs one edit site. The cross-reference comments are a good mitigation but not a substitute.

2. Port 65530 in stickyStrategyClearsPinWhenBeginFails (minor)

Port 65530 is in the Linux ephemeral range (49152-65535) and can be allocated to another process during CI, making the test non-deterministic. A reserved port (e.g. 1) or binding-then-closing a ServerSocket(0) would be more reliable.

3. Multi-line Javadoc on resolveStickyTargetServer() (minor)

Per CLAUDE.md: one short line max. The method name is self-describing; the load-balancer rationale belongs in the PR description.

4. Thread-safety contract not documented (minor)

stickyTransactionServer is read/written without the RWLockContext lock that protects other fields. If RemoteDatabase is single-threaded-per-instance (the likely intent), stating that at the class level prevents a future reader from adding unnecessary locking or assuming it is safe to share across threads.


Summary

The bug fix is correct, the lifecycle state machine (pin on begin, clear on failure or session-end) is sound, and the test coverage for the pinning invariants is good. Items 1 and 2 are worth addressing before merge; items 3 and 4 can be polish in a follow-on commit.

Reviewed by Claude Sonnet 4.6 via Claude Code

…en comment

- RemoteHttpComponent: extract private getStickyPin() and use it from both
  getUrl() and httpCommand() (gemini #1 / claude #1 - duplicate logic)
- RemoteDatabaseTest.stickyStrategyClearsPinWhenBeginFails: replace hardcoded
  port 65530 (in IANA ephemeral range) with ServerSocket(0)+close so the
  port is OS-assigned and deterministically closed (claude #2)
- RemoteDatabase.resolveStickyTargetServer: collapse 3-line Javadoc into a
  single-line // comment per CLAUDE.md (claude #3)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@robfrank

Copy link
Copy Markdown
Collaborator Author

Review response (commit ee62d145e)

Item 1 - duplicate sticky-pin retrieval: Fixed. Extracted RemoteHttpComponent.getStickyPin(); both getUrl() and httpCommand() now call it.

Item 2 - port 65530 ephemeral range: Fixed. The test now binds a ServerSocket(0) to get an OS-assigned port, closes it, and uses that port - no hardcoded value in any range.

Item 3 - multi-line Javadoc on resolveStickyTargetServer: Fixed. Collapsed to a one-line // comment per CLAUDE.md. (The TestableRemoteDatabaseWithLeader Javadoc is left as-is because its single line of content already matches the pre-existing TestableRemoteDatabase Javadoc pattern in the same file.)

Item 4 - thread-safety contract not documented: Pushback. The class-level Javadoc of RemoteHttpComponent already states the contract:

Remote Database implementation. It's not thread safe. For multi-thread usage create one instance of RemoteDatabase per thread.

stickyTransactionServer is governed by that same single-thread-per-instance contract as every other mutable field (sessionId, currentServer, currentPort, leaderServer, currentReplicaServerIndex). Restating the contract on a single field would imply the other fields have different semantics, which they do not. The RWLockContext superclass is not used by this implementation for field-level synchronization - it is available to subclasses that need it.

All 75 unit tests still pass.

@claude

claude Bot commented May 21, 2026

Copy link
Copy Markdown

Code Review - fix(#4273): STICKY strategy HTTP transaction pinning (final pass)

Three previous review cycles have already been completed. The author addressed all raised items in commit `ee62d145e`. This pass reviews the current diff as-is.


What the PR does

Before this fix, `CONNECTION_STRATEGY.STICKY` was a dead branch: `httpCommand()` only branched on `FIXED`, leaving STICKY and ROUND_ROBIN identical. In an HA deployment behind a load-balancer, `begin` could land on pod-A while `command`/`commit` hit pod-B, producing "Remote transaction not found or expired" errors.

The fix adds a `stickyTransactionServer` pin that is set in `begin()`, honoured by both `getUrl()` and `httpCommand()`, and cleared on session end or begin failure.


Verification of previously raised items

Item Status
Docs file removal Not present in diff - resolved
`finally` block in `begin()` Correctly implemented: clears pin only when `getSessionId() == null`
`getStickyPin()` extracted (no duplicate logic) Done - private helper called by both `getUrl()` and `httpCommand()`
`ServerSocket(0)` in failure test Done - OS-assigned port, immediately closed
Single-line comment on `resolveStickyTargetServer()` Done
Thread-safety contract pushback Accepted - class-level Javadoc already states single-thread-per-instance

Remaining observations

1. stickyPin snapshot in httpCommand() is a good subtle safeguard

Taking final Pair<String, Integer> stickyPin = getStickyPin() once at method entry before the retry loop ensures the pin cannot change mid-retry even if a concurrent caller somehow races with `setSessionId(null)`. This is correct behaviour even under the documented single-thread contract, and it is good defensive practice.

2. roundRobinSmokeCheckExecutesAgainstFollower swallows all exceptions

The test catches and logs any exception from the ROUND_ROBIN transaction without asserting anything. This means a connection-refused or NPE would also pass silently. Given the documented intent (ROUND_ROBIN failures are racily acceptable), this is acceptable as a negative-control check, but worth noting it will not catch regressions unrelated to routing.

3. Codecov reports 62.5% patch coverage for RemoteHttpComponent.java

The uncovered lines are in the sticky-pin failover retry path (when the pinned server is unreachable). This path is hard to exercise without a real cluster and is excluded from the unit-test scope. The IT test `RemoteStickyConnectionStrategyIT` covers the live-cluster path. Acceptable.


Overall assessment

The bug fix is correct. The lifecycle state machine (pin on begin, clear on failure or session-end) is sound. The `getStickyPin()` helper eliminates the previously flagged duplication. The `ServerSocket(0)` approach in the failure test is portable and deterministic. The IT test covers all four documented scenarios against a real 3-node Raft cluster.

No blocking issues remain. Ready to merge pending maintainer sign-off.

Reviewed by Claude Sonnet 4.6 via Claude Code

Comment on lines +215 to +242
needs: build-and-package
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2

- name: Set up JDK 21
uses: actions/setup-java@be666c2fcd27ec809703dec50e508c2fdc7f6654 # v5.2.0
with:
distribution: "temurin"
java-version: 21
cache: "maven"

- name: Restore Maven artifacts
uses: actions/cache/restore@27d5ce7f107fe9357f9df03efb73ab90386fccae # v5.0.5
with:
path: ~/.m2/repository
key: maven-repo-${{ github.run_id }}-${{ github.run_attempt }}

- name: Run HA Integration Tests with Coverage
run: ./mvnw verify -DskipTests -Pintegration -Pcoverage --batch-mode --errors --fail-never --show-version -pl ha-raft
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

- name: HA IT Tests Reporter
uses: dorny/test-reporter@a43b3a5f7366b97d083190328d2c652e1a8b6aa2 # v3.0.0
if: success() || failure()
with:
name: HA IT Tests Report
path: "**/failsafe-reports/TEST*.xml"
@codacy-production

codacy-production Bot commented May 21, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results
Complexity 0

View in Codacy

🟢 Coverage 90.91% diff coverage · -7.62% coverage variation

Metric Results
Coverage variation -7.62% coverage variation
Diff coverage 90.91% diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (6918763) 127456 94037 73.78%
Head commit (33f6e71) 159152 (+31696) 105288 (+11251) 66.16% (-7.62%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#4277) 22 20 90.91%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@claude

claude Bot commented May 21, 2026

Copy link
Copy Markdown

Code Review - STICKY Strategy Fix (#4273)

Overview

This PR correctly identifies and fixes a long-standing no-op: all three if-branches in httpCommand() only distinguished FIXED from the rest, leaving STICKY and ROUND_ROBIN behaviorally identical. The fix is well-scoped and the test coverage is solid.


Correctness

Pin lifecycle - one subtle edge case in begin():

The finally block clears the pin when getSessionId() == null:

} finally {
    if (getSessionId() == null)
        setStickyTransactionServer(null);
}

This works correctly when the HTTP call fails outright. However, if the server returns HTTP 204 but omits the ArcadeDB-SessionId header, setSessionId(null) is invoked, which clears both the session and the sticky pin via the new coupling in setSessionId(). A subsequent command() call would then fall back to the load-balancer address for an active-but-untracked transaction. This is arguably a pre-existing server-side bug (returning 204 without a session ID is malformed), but it is worth a comment or an assertion.

Double-clear is harmless but worth acknowledging:

When a transaction ends normally, setSessionId(null) (called from commit()/rollback()) already clears the pin. The finally branch in begin() also clears it on failure. These two paths are independent and correct - the redundant clear on a null pin is a no-op. No action needed, just noting it is intentional.


Thread Safety

stickyTransactionServer is a plain (non-volatile) reference on a class that extends RWLockContext:

private Pair<String, Integer> stickyTransactionServer;

The existing leaderServer field has the same pattern. If two threads share a RemoteDatabase instance and one reads getUrl() (via getStickyPin()) concurrently with the other calling setStickyTransactionServer(), visibility is not guaranteed. RemoteDatabase is likely not designed for cross-thread transaction sharing, but making this volatile would be cheap insurance consistent with RWLockContext's purpose. Consider:

private volatile Pair<String, Integer> stickyTransactionServer;

Minor Code Quality Points

Two routing paths must stay in sync (documented, but fragile):

The comments on httpCommand() and getUrl() note that both paths must be updated together for any STICKY routing change. This cross-path coupling is the main long-term maintenance risk. Acceptable for now, but worth tracking as technical debt.

resolveStickyTargetServer() is package-private for testability:

This is reasonable. The TestableRemoteDatabaseWithLeader subclass pattern used in tests is clean and avoids mockito-style mocking overhead.

Comments on begin() explain the WHY - consistent with project style.


Test Coverage

Coverage is thorough:

  • RemoteHttpComponentTest - 4 unit tests cover getUrl() routing for pinned / unpinned / cleared / ROUND_ROBIN cases.
  • RemoteDatabaseTest - 5 unit tests cover session/pin lifecycle, fallback, leader preference, and failure cleanup. The stickyStrategyClearsPinWhenBeginFails test using a closed ServerSocket is a clean technique.
  • RemoteStickyConnectionStrategyIT - 4 end-to-end tests against a real 3-node Raft cluster cover write success, ROUND_ROBIN smoke, back-to-back STICKY after commit, and STICKY after rollback.

One suggestion: the ROUND_ROBIN smoke test swallows any exception:

} catch (final Exception e) {
    LogManager.instance().log(this, Level.INFO, "TEST: ROUND_ROBIN smoke run raised %s", e.getMessage());
}

This means the test always passes regardless of outcome. A comment explaining this is intentional (verifying the path does not throw unexpectedly other than the known transaction-not-found error) would help future readers. Alternatively, asserting that any thrown exception is an instance of TransactionException would tighten it without making it flaky.


CI Changes

Splitting ha-raft into its own job is the right call - HA tests are slow and their failure mode is different from unit tests. The --fail-never flag is carried over from the existing integration-tests job for consistency. The artifact name ha-integration-coverage-reports feeds the existing coverage aggregation step correctly.


Summary

The fix is correct and the approach is well-designed. Two items worth a follow-up before merge:

  1. volatile on stickyTransactionServer - low-cost, improves correctness under concurrent reads.
  2. ROUND_ROBIN smoke test - tighten the assertion or add a comment so the intent is clear from the test body alone.

Everything else is solid. Good regression test coverage and a clean CI split for HA tests.


Review generated with Claude Code

@robfrank robfrank merged commit 26eeb40 into main May 21, 2026
27 of 32 checks passed
@robfrank robfrank deleted the fix/4273-connection-strategy-sticky-pins-server branch May 21, 2026 16:00
@robfrank robfrank added this to the 26.6.1 milestone May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTTP RemoteDatabase loses transaction session under HA / load-balanced deployments — STICKY and ROUND_ROBIN strategies are functionally identical

2 participants