Skip to content

HTTP RemoteDatabase loses transaction session under HA / load-balanced deployments — STICKY and ROUND_ROBIN strategies are functionally identical #4273

@ruispereira

Description

@ruispereira

Affected version: arcadedb-network 26.4.2 (also present in 26.3.1, 25.x — based on bytecode inspection)
Component: com.arcadedb.remote.RemoteHttpComponent, com.arcadedb.remote.RemoteDatabase

Summary

In an HA deployment (3 ArcadeDB nodes) accessed via the HTTP REST API, write transactions issued through RemoteDatabase.transaction(...) intermittently fail on the server side with:

INFO [PostCommandHandler] Error on transaction execution (PostCommandHandler): Remote transaction 'AS-' not found or expired

│ 2026-05-21 07:38:13.930 INFO  [PostCommandHandler] <vcc-superx-arcadedb-0> Error on transaction execution (PostCommandHandler): Remote transaction 'AS-adc28f3c-4302-40e7-8f04-ef3cf12b66bf' not found or expired                                                                                                                                                                         │
│ 2026-05-21 07:38:15.043 INFO  [PostCommandHandler] <vcc-superx-arcadedb-0> Error on transaction execution (PostCommandHandler): Remote transaction 'AS-eb38e1c7-9262-4281-8f67-42ebfb752b9a' not found or expired                                                                                                                                                                         │
│ 2026-05-21 07:42:11.715 INFO  [PostCommandHandler] <vcc-superx-arcadedb-0> Error on transaction execution (PostCommandHandler): Remote transaction 'AS-aff113a3-8c69-4c5e-a2a3-4be9aa9bf01f' not found or expired                                                                                                                                                                         │
│ 2026-05-21 07:42:12.816 INFO  [PostCommandHandler] <vcc-superx-arcadedb-0> Error on transaction execution (PostCommandHandler): Remote transaction 'AS-98bf440b-bd94-4766-8b52-373fa1b9fb64' not found or expired                                                                                                                                                                         │
│ 2026-05-21 07:43:16.367 INFO  [PostCommandHandler] <vcc-superx-arcadedb-0> Error on transaction execution (PostCommandHandler): Remote transaction 'AS-19a6f235-ba29-4c72-bc3a-7d11c4358be3' not found or expired                                                                                                                                                                         │
│ 2026-05-21 07:43:17.478 INFO  [PostCommandHandler] <vcc-superx-arcadedb-0> Error on transaction execution (PostCommandHandler): Remote transaction 'AS-61fa3df1-2741-454a-b490-36a9e63c5995' not found or expired                                                                                                                                                                         │
│ 2026-05-21 07:47:38.190 INFO  [PostCommandHandler] <vcc-superx-arcadedb-0> Error on transaction execution (PostCommandHandler): Remote transaction 'AS-cc9ddc0d-7827-44dd-a461-ea253adde68c' not found or expired

Root cause is that the three HTTP requests that compose a remote transaction (begin → command → commit) carry a session id (arcadedb-session-id) bound to the node that handled begin, but subsequent requests in the same logical transaction may be routed to a different node (no shared session state across the HA cluster). The receiving node has no record of the session and rejects it.

The client-side mitigation that should exist for this — CONNECTION_STRATEGY.STICKY — does not actually pin requests to a single server in the current implementation.

Reproduction

  1. Run an ArcadeDB HA cluster (3 nodes) behind any load balancer that distributes connections (Kubernetes Service, HAProxy, nginx, etc.).
  2. Configure a Java client:
    RemoteDatabase db = new RemoteDatabase(host, 2480, "mydb", user, pass);
    db.setConnectionStrategy(RemoteHttpComponent.CONNECTION_STRATEGY.STICKY);
    db.transaction(() -> db.command("sql", "UPDATE …"), false, 5);
  3. Under moderate concurrency, server logs show Remote transaction 'AS-…' not found or expired on one or more nodes.

Root cause — CONNECTION_STRATEGY.STICKY is a no-op

Inspecting the code base of RemoteHttpComponent (26.4.2), the only places that branch on connectionStrategy are:

  1. Number of retries (httpCommand):
    retries = (strategy == FIXED) ? sameServerErrorRetries : haServerErrorRetries;
  2. Initial target selection (httpCommand):
    target = (strategy == FIXED)
    ? new Pair(originalServer, originalPort)
    : (leaderRequired && leaderServer != null
    ? leaderServer
    : new Pair(currentServer, currentPort));
  3. Error/failover path (httpCommand):
    if (strategy == FIXED) { log warning; retry same server; }
    else { reloadClusterConfiguration(); currentServer = nextReplica(); }

In none of these branches is STICKY distinguished from ROUND_ROBIN. Both fall into the same else paths. The only other reference to STICKY in the class is a FINE-level log line ("Remote Database configured with leader=… replicas=… strategy=%s"). The full grep result over arcadedb-network-26.4.2.jar:

RemoteHttpComponent.class : field + getter/setter + 3 == FIXED branches + 1 log
RemoteHttpComponent$CONNECTION_STRATEGY : enum constants {STICKY, ROUND_ROBIN, FIXED}

(arcadedb-engine contains zero references.)

So setConnectionStrategy(STICKY) has no behavioural effect vs the default ROUND_ROBIN.

Why this causes the transaction loss

Even though currentServer is not explicitly rotated on success (so two consecutive requests on the same RemoteDatabase instance should target the same hostname), in practice the requests still arrive at different physical nodes whenever the configured hostname resolves to a load-balanced endpoint (the common case in Kubernetes / cloud deployments). The JDK HttpClient may open new TCP connections (HTTP/1.1 keep-alive timeout, HTTP/2 GOAWAY, idle eviction); each new connection is re-balanced by the LB to a different backend pod, and the session id is unknown there.

A working STICKY implementation should, after the first response, pin the client to the actual backend that handled begin — for example by:

  • using the discovered pod-specific address (leaderAddress / replicaAddresses) instead of the configured load-balancer hostname for subsequent calls within the same RemoteDatabase instance, and/or
  • consistently reusing the same underlying connection for the whole transaction lifetime, and/or
  • exposing the backing node id in the response and resolving subsequent calls to that node directly.

Suggested fix

  1. Make STICKY actually sticky. After the first successful response (or after requestClusterConfiguration()), pin currentServer/currentPort to a concrete cluster member (not the user-provided LB hostname) for the lifetime of the RemoteDatabase instance. Failover (move to next replica) should only occur on connection error, not opportunistically.
  2. Or, ideally, scope stickiness to the transaction. Once begin returns a session id, capture the node that issued it and route all subsequent calls bearing that arcadedb-session-id to the same node until commit/rollback.
  3. Document clearly which deployment topologies require which strategy. Today the names (STICKY, ROUND_ROBIN) imply a difference that the code does not deliver.

Environment

  • ArcadeDB cluster: 3 nodes (HA) on Kubernetes (StatefulSet + headless Service)
  • Server reports cluster topology correctly: getLeaderAddress() = pod-0…:2480, getReplicaAddresses() = [pod-0…:2480, pod-1…:2480] (separate issue: only 2 of 3 nodes were reported in our cluster — under investigation on our side)
  • Java client: arcadedb-network 26.4.2, JDK 21, default HttpClient

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions