Skip to content

feat: [#4727] HA auto-acquire of databases a joining node has never seen#4739

Merged
lvca merged 10 commits into
mainfrom
worktree-issue-4727-ha-auto-acquire
Jun 26, 2026
Merged

feat: [#4727] HA auto-acquire of databases a joining node has never seen#4739
lvca merged 10 commits into
mainfrom
worktree-issue-4727-ha-auto-acquire

Conversation

@lvca

@lvca lvca commented Jun 25, 2026

Copy link
Copy Markdown
Member

Summary

Fixes #4727. A node joining a Raft HA cluster with an empty data directory never acquired databases it had not previously seen on disk - it only refreshed databases already present, so runtime-created/imported databases were silently missing on new nodes (the #4522 StatefulSet scale-up case: new pods came up with only the defaultDatabases entry).

Now a joining node reconciles its local database set against the leader's and pulls (full snapshot install) any database it is missing.

Root cause

ArcadeStateMachine.notifyInstallSnapshotFromLeader iterated server.getDatabaseNames() (local DBs only), so a brand-new node had nothing to install. There was no step that listed the leader's databases and pulled the missing ones.

What changed

Core (engine / ha-raft)

  • New global setting arcadedb.ha.autoAcquireDatabases (default true), per-node, re-read each boot. It gates the join-time pull only; the create-time push via INSTALL_DATABASE_ENTRY stays mandatory (core Raft consensus).
  • notifyInstallSnapshotFromLeader now enumerates the leader's databases (reusing the existing POST /api/v1/cluster/bootstrap-state RPC via the new LeaderDatabaseQuery) and installs the union of leader + local DBs - refreshing existing ones, acquiring never-seen ones.
  • Additive-only guard: a database present locally but absent on the leader is never dropped (keeps a real DROP_DATABASE_ENTRY distinguishable from "the leader never had it"); it is flagged LEADER_MISSING.
  • SnapshotInstaller.acquireNewDatabase: crash-safe acquisition of a never-seen database. The download is staged under a reserved databases/.acquire-<name>/ directory (skipped by the startup scan) and published with a single atomic rename, so a crash mid-download can never leave a half-written databases/<name>/ that loadDatabases would try to open. Startup recovery cleans leftover .acquire-* staging dirs.

Observability

  • Per-database acquireStatus (ACQUIRING / ACQUIRED / LEADER_MISSING / FAILED) in GET /api/v1/cluster.
  • A leader-missing-databases cluster alert (the How to replicate databases when new cluster nodes are added in Kubernetes ? #4522 aggravating factor: a node lacking a DB is elected leader).
  • Optional GET /api/v1/cluster?presence=true per-node x per-database presence fan-out (gated so the auto-poll stays cheap).

Studio Cluster tab

  • Transfer leadership (peer picker), Step down, per-database Verify-consistency buttons, and a database presence matrix.

Tests

  • SnapshotAcquireStagingRecoveryTest (unit) - crash recovery of the reserved acquisition staging dir; verifies no phantom non-reserved database directory is left behind.
  • SnapshotAcquireNewDatabaseIT - drives acquireNewDatabase end-to-end against a live leader (HTTP snapshot download → atomic publish → register), asserting the DB is acquired with all records and no leftover staging/marker files.
  • Full ha-raft suite: 355 tests, 0 failures.

Note: full-cluster restart ITs were intentionally not used to assert this path - in-process Ratis catches a restarted node up via log replay rather than InstallSnapshot, so such a test would pass even without the fix. The direct acquireNewDatabase IT exercises the genuinely new code deterministically.

🤖 Generated with Claude Code

A node joining a Raft HA cluster with an empty data directory now reconciles its
local database set against the leader's and pulls (full snapshot install) any
database it is missing, instead of only refreshing databases already on disk.
Fixes the #4522 StatefulSet scale-up case where new pods only had the
defaultDatabases entry.

- New arcadedb.ha.autoAcquireDatabases (default true) gates the join-time pull;
  create-time push via INSTALL_DATABASE_ENTRY stays mandatory (core consensus).
- notifyInstallSnapshotFromLeader enumerates the leader's databases (reusing the
  bootstrap-state RPC) and installs the union; additive-only, never drops a
  database the leader lacks (flagged LEADER_MISSING).
- SnapshotInstaller.acquireNewDatabase: crash-safe never-seen acquisition staged
  under reserved databases/.acquire-<name>/ and published with an atomic rename,
  so a crash can never leave a half-written database the startup scan would open.
- Observability: per-db acquireStatus in /api/v1/cluster, leader-missing alert,
  optional ?presence=true per-node/per-db presence fan-out.
- Studio Cluster tab: transfer leadership, step down, verify consistency, and a
  per-database x per-node presence matrix.

Tests: SnapshotAcquireStagingRecoveryTest (crash recovery of the staging dir),
SnapshotAcquireNewDatabaseIT (end-to-end acquire from a live leader).
@lvca lvca self-assigned this Jun 25, 2026
@lvca lvca added this to the 26.7.1 milestone Jun 25, 2026
@mergify

mergify Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Tick the box to add this pull request to the merge queue (same as @mergifyio queue).

  • Queue this pull request

@codacy-production

codacy-production Bot commented Jun 25, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics -136 complexity

Metric Results
Complexity -136

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements crash-safe auto-acquisition of unseen databases for joining cluster nodes (issue #4727). It introduces a reserved staging directory ('.acquire-') with atomic renames to prevent corrupt databases from being loaded on startup, exposes a database presence matrix and leadership controls in the Studio UI, and adds comprehensive integration and recovery tests. The review feedback highlights a critical bug in SnapshotInstaller where a failed database open prevents proper handle closure and cleanup on Windows due to file locking. Additionally, the feedback identifies two instances in ArcadeStateMachine and GetClusterHandler where catching general exceptions swallows InterruptedException without restoring the thread's interrupted status.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +341 to +346
try {
if (server.existsDatabase(databaseName))
((DatabaseInternal) server.getDatabase(databaseName)).getEmbedded().close();
} catch (final Exception ignore) {
// best-effort close before removal
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In the catch (final RuntimeException openEx) block, calling server.getDatabase(databaseName) will attempt to load/open the database again. Since the database just failed to open, this call will re-throw the exact same RuntimeException, which is caught and ignored by the inner catch (final Exception ignore). As a result, db.getEmbedded().close() is never executed.

On Windows, failing to close the database handles will prevent deleteDirectoryIfExists(dbPath) from succeeding due to file locking. This leaves a corrupt/partial database directory under the non-reserved name databases/<databaseName>/. On the next startup, ArcadeDBServer.loadDatabases will attempt to open this corrupt directory and crash the server, defeating the crash-safety design of acquireNewDatabase.

Using server.getDatabase(databaseName, false, false) avoids re-triggering the load/open exception and allows retrieving the registered instance safely to close it.

Suggested change
try {
if (server.existsDatabase(databaseName))
((DatabaseInternal) server.getDatabase(databaseName)).getEmbedded().close();
} catch (final Exception ignore) {
// best-effort close before removal
}
try {
if (server.existsDatabase(databaseName)) {
final DatabaseInternal db = (DatabaseInternal) server.getDatabase(databaseName, false, false);
if (db != null)
db.getEmbedded().close();
}
} catch (final Exception ignore) {
// best-effort close before removal
}

Comment on lines +636 to +642
} catch (final Exception e) {
LogManager.instance().log(this, Level.WARNING,
"Could not list the leader's databases for auto-acquire (%s); refreshing only the databases already "
+ "present locally. Missing databases will be retried on the next reconcile.", e.getMessage());
refreshExistingDatabases(leaderHttpAddr, leaderHttpsAddr, clusterToken);
return;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Swallowing InterruptedException without restoring the interrupted status of the thread is a bad practice in Java concurrency. When InterruptedException is caught, the thread's interrupted status is cleared. If it is not restored by calling Thread.currentThread().interrupt(), upstream callers or the thread pool executor may not realize the thread was interrupted, potentially preventing a clean shutdown.

Suggested change
} catch (final Exception e) {
LogManager.instance().log(this, Level.WARNING,
"Could not list the leader's databases for auto-acquire (%s); refreshing only the databases already "
+ "present locally. Missing databases will be retried on the next reconcile.", e.getMessage());
refreshExistingDatabases(leaderHttpAddr, leaderHttpsAddr, clusterToken);
return;
}
} catch (final InterruptedException e) {
Thread.currentThread().interrupt();
LogManager.instance().log(this, Level.WARNING,
"Database reconciliation interrupted; refreshing only the databases already present locally.");
refreshExistingDatabases(leaderHttpAddr, leaderHttpsAddr, clusterToken);
return;
} catch (final Exception e) {
LogManager.instance().log(this, Level.WARNING,
"Could not list the leader's databases for auto-acquire (%s); refreshing only the databases already "
+ "present locally. Missing databases will be retried on the next reconcile.", e.getMessage());
refreshExistingDatabases(leaderHttpAddr, leaderHttpsAddr, clusterToken);
return;
}

Comment on lines +183 to +188
} catch (final Exception e) {
LogManager.instance().log(this, Level.WARNING,
"Presence matrix: could not query peer '%s' (%s)", peerIdStr, e.getMessage());
unreachable.add(peerIdStr);
continue;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the state machine reconciliation, catching Exception here swallows InterruptedException without restoring the thread's interrupted status. This can prevent proper thread pool management and clean shutdowns.

        } catch (final InterruptedException e) {
          Thread.currentThread().interrupt();
          LogManager.instance().log(this, Level.WARNING,
              "Presence matrix query interrupted for peer '%s'", peerIdStr);
          unreachable.add(peerIdStr);
          continue;
        } catch (final Exception e) {
          LogManager.instance().log(this, Level.WARNING,
              "Presence matrix: could not query peer '%s' (%s)", peerIdStr, e.getMessage());
          unreachable.add(peerIdStr);
          continue;
        }

@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown

PR Review: HA auto-acquire of databases a joining node has never seen (#4727)

Reviewed against CLAUDE.md conventions. Overall this is a well-structured, carefully-documented change that solves a real correctness gap (the #4522 StatefulSet scale-up case). The crash-safe staging/atomic-rename design is the right approach, the additive-only guard is a sound decision, and the recovery-path test coverage is good. A few points worth addressing before merge.

Correctness / potential bugs

  1. Stale LEADER_MISSING produces a false-positive cluster alert. acquireStatuses entries are never removed. If a database is flagged LEADER_MISSING and is later legitimately dropped locally (DROP_DATABASE_ENTRY replay), or the operator transfers leadership and resyncs, the stale LEADER_MISSING entry survives, so getDatabasesWithAcquireState(LEADER_MISSING) keeps the leader-missing-databases alert raised for a database that no longer exists locally. Consider pruning acquireStatuses for names no longer in server.getDatabaseNames() at the start of each reconcile (and/or removing the entry on drop). The per-database status in GetClusterHandler is shielded because it only iterates existing DBs, but the alert path is not.

  2. InterruptedException swallowed without restoring the interrupt flag. In reconcileDatabasesFromLeader, LeaderDatabaseQuery.fetch(...) declares throws IOException, InterruptedException, but catch (final Exception e) degrades to refreshExistingDatabases and loses the interrupt. If this runs on a pool thread being shut down, the interrupt is silently dropped. Suggest catching InterruptedException separately (or re-asserting Thread.currentThread().interrupt()).

  3. First-DB failure aborts the whole reconcile pass. The install loop rethrows IOException on the first failing database, so remaining leader databases in that pass are never attempted and the additive-only guard loop never runs. This is intentional (Ratis re-triggers, installs are idempotent) and documented - but worth confirming it is acceptable that a single flaky download blocks LEADER_MISSING visibility for the other databases until the next pass.

Minor / style

  1. Hardcoded "root" literal. LeaderDatabaseQuery uses the string "root" for X-ArcadeDB-Forwarded-User. RaftHAServer already exposes a FORWARDED_ROOT_USER constant for exactly this. BootstrapElection also inlines "root", so this is consistent with one existing caller, but reusing the constant would be cleaner per the "reuse existing components" guideline.

  2. http:// scheme hardcoded in LeaderDatabaseQuery.fetch. This matches the existing BootstrapElection.queryPeer pattern (the bootstrap-state RPC is always plain HTTP cluster-internal), so it is not a regression - just flagging that an HTTPS-only cluster's peer-to-peer RPCs already assume plaintext here.

  3. Presence matrix fan-out is sequential and blocking. buildPresenceMatrix queries each peer one at a time with LeaderDatabaseQuery.fetch; with N peers and slow/unreachable nodes the worst case is N x timeout. It is gated behind an explicit button (not the auto-poll), so acceptable, but BootstrapElection already does this fan-out async via sendAsync - reusing that style would bound the latency.

Performance

  • The reconcile and presence calls run on the common-pool notifyInstallSnapshotFromLeader path (already a documented NOTE (concurrency) caller). The added blocking bootstrap-state HTTP call is consistent with the existing snapshot-download blocking there, so no new pool-discipline violation. Good.
  • Static reused HttpClient with a bounded connect timeout - good, no per-call client allocation.

Security

  • Reuses the established cluster-token + forwarded-user auth pattern; no new surface. The ?presence=true fan-out and the verify/stepdown/leader endpoints are all leader-gated, and the Studio JS HTML-escapes peer/db names (escapeHtml) before injecting into the DOM - no obvious XSS.

Test coverage

  • SnapshotAcquireStagingRecoveryTest covers the recovery invariants well (reserved-prefix guard, interrupted-staging cleanup, coexisting pending-swap + staging). SnapshotAcquireNewDatabaseIT exercises the happy path end-to-end deterministically; the rationale for skipping full-cluster-restart ITs is sound.
  • Gaps worth considering: (a) the additive-only guard (local DB absent on leader stays put and is flagged LEADER_MISSING); (b) the concurrent-create race fallback in acquireNewDatabase (DB materializes mid-download, falls back to install); (c) the autoAcquire=false legacy path. These are the genuinely new branches and aren't currently asserted.

Nice work overall - the documentation on the staging/rename invariant is exemplary and made review straightforward.

Automated review; please verify suggestions against intent.

@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 30.14085% with 248 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.23%. Comparing base (dc19716) to head (b87b7ae).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
...com/arcadedb/server/ha/raft/GetClusterHandler.java 0.00% 74 Missing ⚠️
...om/arcadedb/server/ha/raft/ArcadeStateMachine.java 48.50% 66 Missing and 3 partials ⚠️
...com/arcadedb/server/ha/raft/SnapshotInstaller.java 5.79% 65 Missing ⚠️
...m/arcadedb/server/ha/raft/LeaderDatabaseQuery.java 23.80% 32 Missing ⚠️
...ava/com/arcadedb/server/ha/raft/ClusterAlerts.java 76.47% 8 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4739      +/-   ##
============================================
- Coverage     65.31%   65.23%   -0.09%     
- Complexity      567      622      +55     
============================================
  Files          1673     1674       +1     
  Lines        131833   132395     +562     
  Branches      28232    28320      +88     
============================================
+ Hits          86103    86362     +259     
- Misses        33833    34124     +291     
- Partials      11897    11909      +12     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- SnapshotInstaller.acquireNewDatabase now VALIDATES the downloaded snapshot
  while it is still under the reserved .acquire-<name> staging dir (open
  read-only + close) BEFORE the atomic publish. A corrupt snapshot is discarded
  in staging and never reaches databases/<name>/, closing the window where a
  failed post-publish open could leave an undeletable corrupt directory (Windows
  file-lock) that the startup scan would then try to open. Replaces the previous
  cleanup that re-opened the DB to close it (which re-threw and never closed).
- Restore the interrupt flag on InterruptedException in
  reconcileDatabasesFromLeader and GetClusterHandler.buildPresenceMatrix instead
  of swallowing it via catch(Exception).
- Prune stale acquireStatuses entries (databases no longer local nor on the
  leader) at reconcile start so the leader-missing alert clears once a database
  is dropped/redistributed.
- Reuse RaftHAServer.FORWARDED_ROOT_USER (now package-private) in
  LeaderDatabaseQuery instead of the inlined "root" literal.

Tests: ClusterAlertsTest covers the leader-missing alert builder (raised with
names / absent when empty); SnapshotAcquireNewDatabaseIT adds the
concurrent-create fallback (acquire on an existing DB refreshes in place).
@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review: PR #4739 — HA auto-acquire of databases a joining node has never seen

Thorough, well-documented change. The crash-safety design (reserved .acquire-<name>/ staging, validate-before-publish, single atomic rename, with startup cleanup of leftover staging dirs) is exactly the right shape for a never-seen DB that has no previous copy to roll back to, and the "additive-only, never drop a LEADER_MISSING db" invariant is the correct conservative choice. Backward compatibility is preserved (the ClusterAlerts.scan(server) overload, the HA_AUTO_ACQUIRE_DATABASES gate, the legacy refreshExistingDatabases fallback). Nice work on the test coverage too: the staging-recovery unit test and the direct acquireNewDatabase IT exercise the genuinely new code deterministically.

A few items, in priority order.

1. LeaderDatabaseQuery hardcodes http:// and ignores SSL — feature silently disabled on SSL clusters (main concern)

LeaderDatabaseQuery.fetch() builds "http://" + httpAddr + ... and uses a static HttpClient with no SSL context. This is inconsistent with every other inter-node RPC in this module:

On an SSL-enabled cluster where the plain-HTTP listener is disabled, fetch() throws for the leader, so reconcileDatabasesFromLeader quietly degrades to refreshExistingDatabases — i.e. auto-acquire never happens and the operator gets only a WARNING log, defeating the whole feature for exactly the StatefulSet/empty-node case it targets. The presence matrix fan-out in GetClusterHandler.buildPresenceMatrix has the same problem (every peer reported unreachable). Suggest mirroring the existing pattern: take both the http and https peer addresses, pick scheme by NETWORK_USE_SSL, and apply SnapshotInstaller.buildSSLContext(server). reconcileDatabasesFromLeader already has leaderHttpsAddr in hand to pass through.

2. acquireNewDatabase does not take the INSTALLS_IN_FLIGHT guard that install() has

install() registers the resolved path in INSTALLS_IN_FLIGHT and logs loudly on overlap, since the lifecycle assumes installs for a given DB never overlap. acquireNewDatabase shares the same hazard (two concurrent acquires would download into the same .acquire-<name> staging dir and the same dbPath) but has no such guard. Ratis serializes InstallSnapshot per follower so this is unlikely, but adding the same guard (or a comment explaining why it is unnecessary here) would keep the two paths consistent and any future coordination bug diagnosable.

3. acquireStatus semantics blur for plain refreshes

reconcileDatabasesFromLeader sets ACQUIRING then ACQUIRED for every leader DB on every install-snapshot, including DBs that already existed locally and were merely refreshed. After the first reconcile, GET /api/v1/cluster reports acquireStatus: ACQUIRED for long-standing databases that were never actually "acquired". Not a bug, but the field name implies a one-time acquisition; consider only emitting ACQUIRING/ACQUIRED on the genuinely-new (isNew) path so operators can distinguish a true pull from a routine snapshot refresh.

4. Minor

  • Doc vs behavior: the HA_AUTO_ACQUIRE_DATABASES Javadoc says "re-read on every boot"; it is actually read live via getValueAsBoolean on every reconcile (fine/better) — worth aligning the wording.
  • Presence fan-out cost: buildPresenceMatrix queries peers sequentially on the Undertow worker thread, so worst-case latency is peers x HA_BOOTSTRAP_TIMEOUT_MS. It is opt-in (?presence=true) and leader-only, so acceptable, but a note (or a parallel fan-out) would help larger clusters. Also bootstrap-state calls server.getDatabase(...), which can open a closed DB on the queried peer — a slight tension with the "a status poll must never trigger a database open" principle stated for the cheap poll path.
  • Studio: verifyDatabase builds the onclick via JSON.stringify(name).replace(/"/g,"&quot;") — works for normal names, but routing through a data- attribute + event handler would be more robust if DB names ever contain quotes.

Tests

Good coverage of the new code: staging recovery (interrupted acquire, coexisting pending-swap, completed-DB-untouched), the end-to-end acquire IT, and the concurrent-create refresh-fallback branch. The paths not covered are the SSL behavior of item 1 and the leader-unreachable degradation branch — a unit test asserting the fallback fires when LeaderDatabaseQuery.fetch throws would lock in the "never regress the install" guarantee.

Overall a solid fix for #4727. Item 1 is the only one I would consider blocking, since it determines whether the feature actually works on the SSL deployments most likely to be running a StatefulSet.

Reviewed against CLAUDE.md conventions (HA concurrency rules, JSON helpers, test style).

…s semantics

Addresses code-review findings:

1. (blocking) LeaderDatabaseQuery now honors arcadedb.ssl.enabled instead of
   hardcoding http://: it prefers the peer's HTTPS endpoint and trusts the
   cluster keystore via SnapshotInstaller.buildSSLContext, falling back to plain
   HTTP with a one-time warning only when no HTTPS address is known - mirroring
   SnapshotInstaller.downloadWithRetry (#4470). Without this, auto-acquire and
   the presence matrix were silently disabled on SSL-only clusters (the
   plain-HTTP listener is typically off there), defeating the feature for the
   StatefulSet/empty-node case it targets. Both callers updated to pass the peer
   HTTPS address + server (reconcileDatabasesFromLeader, buildPresenceMatrix).
2. acquireNewDatabase now takes the same INSTALLS_IN_FLIGHT overlap guard install()
   uses, releasing it before delegating to install() on the concurrent-create
   fallback so no spurious overlap is logged.
3. acquireStatus (ACQUIRING/ACQUIRED/FAILED) is recorded only for genuinely-new
   acquisitions; a routine snapshot refresh of an already-present database is no
   longer reported as ACQUIRED.
4. Doc/wording: HA_AUTO_ACQUIRE_DATABASES is read live per reconcile (not per
   boot); note the presence fan-out's sequential cost and remote-open tension.

Tests: LeaderDatabaseQueryTest covers scheme selection (http when SSL off, https
when SSL on + https addr known, http fallback when SSL on but no https addr, null
when no usable address).
@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review: HA auto-acquire of unseen databases (#4727)

Reviewed the full diff. This is a well-structured, carefully-reasoned change. The crash-safety design in SnapshotInstaller.acquireNewDatabase (reserved .acquire-<name> staging -> validate-while-staged -> single atomic rename -> register, with startup recovery for orphaned staging dirs) is the highlight: it correctly closes the "half-written databases/<name>/ gets opened by the boot scan" hazard, and validating before publish is the right call since a never-seen DB has no copy to roll back to. The SSL-aware endpoint selection, the additive-only guard, and the inline rationale comments are all excellent.

A few things worth addressing.

1. LEADER_MISSING status/alert is sticky and never clears (real, minor)

In reconcileDatabasesFromLeader, once a DB is flagged LEADER_MISSING, I don't see a path that ever clears it short of a process restart:

  • The prune at the top only removes entries where !server.existsDatabase(name) && !leaderDbNames.contains(name) - a LEADER_MISSING DB still exists locally, so it survives.
  • When the leader later regains the DB (resync / leadership transfer), the reconcile loop sees isNew == false, takes the refresh branch, and deliberately does not touch acquireStatuses.
  • The additive guard at the bottom only sets the flag; it never unsets it.

Result: the leader-missing-databases cluster alert stays raised (and is shown even on a node that has since become the leader and holds the DB), which is misleading for the exact recovery flow this PR documents. Suggest clearing a stale LEADER_MISSING/FAILED entry when an existing DB is found on the leader, e.g. in the refresh branch acquireStatuses.remove(dbName) (or transition it to a terminal "OK" state).

2. Test coverage gap on the reconcile logic itself

The new tests cover endpoint selection, the alert builder, staging crash-recovery, and the acquireNewDatabase happy/fallback paths - good. But reconcileDatabasesFromLeader (the union of leader+local, the LEADER_MISSING flagging, and the leader-unreachable degrade-to-refresh fallback) is private and untested. The sticky-status bug in #1 is exactly what a small reconcile-level test would have caught. Consider extracting the set-reconciliation into a package-private pure helper (leaderDbs + localDbs -> actions/statuses) and unit-testing the transitions.

3. Performance / load notes (acceptable, just flagging)

  • reconcileDatabasesFromLeader calls LeaderDatabaseQuery.fetch against the leader on every InstallSnapshot. As the javadoc notes, the bootstrap-state handler may open closed databases on the queried node to fingerprint them, so a follower catching up can force the leader to open all its DBs. Infrequent enough to be fine, but worth being aware of on clusters with many large DBs.
  • The presence fan-out is sequential on the Undertow worker thread, worst-case peers x HA_BOOTSTRAP_TIMEOUT_MS (default 120s each). It is opt-in/leader-only so OK, but a slow/hung peer can tie up a worker for a long time; the code comment already acknowledges a parallel fan-out as the fix if needed.

Minor

  • validateSnapshotOpens: new DatabaseFactory(...) is not closed (only the Database is). DatabaseFactory.close() just clears callbacks so the impact is negligible, but for symmetry you could try (final DatabaseFactory f = ...; final Database db = f.open(...)).
  • The endpoints the new Studio buttons call (/cluster/leader, /cluster/stepdown, /cluster/verify/{db}) all exist - confirmed. Studio output is consistently run through escapeHtml, good.

Verdict

Solid, defensive implementation that fixes the root cause cleanly. The only functional issue is the sticky LEADER_MISSING alert (#1); everything else is polish. Nice work.

Reviewed with Claude Code.

… plan

Addresses the second review round:

1. Sticky LEADER_MISSING status/alert. Once flagged, a LEADER_MISSING database's
   status never cleared short of a restart, so the leader-missing cluster alert
   lingered through the very resync / leadership-transfer recovery flow this PR
   documents. Now:
   - the refresh branch clears the acquisition status (acquireStatuses.remove)
     when the leader regains a database this node already has, and
   - becoming LEADER clears any LEADER_MISSING entries (meaningless on the leader,
     which never runs the follower reconcile path).
2. Extracted the set-reconciliation into a pure, package-private
   ArcadeStateMachine.classifyReconcile(leaderDbs, localDbs) -> ReconcilePlan
   {toAcquire, toRefresh, leaderMissing} and drove reconcileDatabasesFromLeader
   from it, so the union/difference + LEADER_MISSING logic is unit-testable
   (ReconcilePlanTest) - the gap that let the sticky-status bug through.
3. Minor: validateSnapshotOpens now closes the DatabaseFactory too (try-with-
   resources on both factory and database).

Tests: ReconcilePlanTest covers empty-node/full-replica/mixed/orphan splits and
sorted-output determinism.
@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown

Code Review: HA auto-acquire of never-seen databases (#4727)

Reviewed the full diff plus the surrounding HA code. This is a well-engineered fix: the crash-safe staging/atomic-rename approach is the right pattern, the set-reconciliation logic is cleanly factored into a pure, unit-tested method, and the SSL-aware transport, additive-only guard, and graceful degradation on an unreachable leader all show careful thought. The Javadoc is genuinely useful. Tests cover the pure logic, crash recovery, and an end-to-end acquire. Nice work overall - the comments below are mostly refinements, with one robustness point worth discussing.

Potential issue: one failing acquire starves refresh of healthy databases

In `reconcileDatabasesFromLeader`, the `plan.toAcquire()` loop runs before `plan.toRefresh()`, and `acquireNewDatabase` rethrows `IOException` (e.g. from `validateSnapshotOpens`), which aborts the whole method:

```java
for (final String dbName : plan.toAcquire()) {
...
SnapshotInstaller.acquireNewDatabase(...); // throws -> aborts before any refresh
}
for (final String dbName : plan.toRefresh()) { ... } // never reached on failure
```

Because Ratis re-triggers the same `InstallSnapshot`, a leader database whose snapshot persistently fails validation (corrupt/incompatible) would abort at the same point every pass, so the already-replicated databases on this follower never get refreshed - a single bad new DB indefinitely starves the healthy ones. Consider collecting per-database failures and continuing (still failing the overall install at the end so Ratis retries), or at least running the refresh loop before propagating an acquire failure. The `FAILED` status you already record would then reflect a partial result rather than an all-or-nothing abort.

Minor: error-state leader databases are treated as present

`PostBootstrapStateHandler` includes databases it could not open with `lastTxId=-1` / empty fingerprint, and `LeaderDatabaseQuery.parse` adds every entry to the set unconditionally. So a follower may try to `acquireNewDatabase` a database the leader itself cannot open. The download would likely fail validation and surface as `FAILED` (and feed the starvation above). Filtering out entries the peer reported an `error` / `lastTxId == -1` for might be cleaner, since such an entry is not a usable source.

Minor: validateSnapshotOpens opens under the staging name

The snapshot is opened read-only while still named `.acquire-`, then atomically renamed to ``. This relies on the database name being derived from the directory at open time (so no `.acquire-` name is persisted) and on the read-only open leaving no lock/WAL artifacts behind in staging that would then be published into the final directory. Both appear to hold for ArcadeDB, but it's worth a one-line confirmation since a stale artifact carried across the rename would be hard to diagnose.

Minor: unused supplier flexibility

`acquireNewDatabase` takes `Supplier` for the leader addresses, but every caller passes a captured constant (`() -> leaderHttpAddr`). If the intent was to re-resolve the current leader across retries, the constant lambda defeats it; if not, plain `String` params (matching `install`) would be simpler. Not a bug - just an inconsistency.

Performance / scope notes (no action required)

  • The `?presence=true` fan-out is sequential on the Undertow worker thread (worst case `peers x HA_BOOTSTRAP_TIMEOUT_MS`) and can open a closed DB on each remote peer to fingerprint it. Both are clearly documented and gated behind an opt-in leader-only path, so this is acceptable; a parallel fan-out is the obvious future improvement for large clusters.
  • The HTTPS path builds a fresh `HttpClient` per call (correctly closed via try-with-resources on Java 21). Fine given how rarely these paths run.

Studio

XSS handling looks solid - `escapeHtml` is applied to all interpolated peer/database names, and `verifyDatabase`'s onclick arg is JSON-encoded and quote-escaped. The graceful degradation when cards are absent (older `cluster.html`) is a nice touch.

Tests

Good coverage of the pure reconcile logic, endpoint/scheme selection, crash recovery (including the combined acquire-staging + pending-swap case), and an end-to-end acquire with the concurrent-create fallback branch. Two gaps worth considering: (1) a test asserting reconcile behavior when one acquire fails but other databases should still be refreshed (the starvation point above), and (2) the `LEADER_MISSING` -> alert path is unit-tested at the `ClusterAlerts` level but not the `ArcadeStateMachine` transition that populates it.

Conventions

Matches the repo style well (final params, no fully-qualified names, `assertThat(...)` AssertJ syntax, Apache headers, no Claude attribution in sources). The `SnapshotAcquireNewDatabaseIT` is correctly `IT`-suffixed so it stays out of the normal `mvn test` run.

Note: I could not run `mvn` in this sandbox, so I reviewed statically and relied on the stated "355 tests, 0 failures" for the suite result.

🤖 Generated with Claude Code

…ble leader DBs

Addresses the third review round:

1. A failing acquire no longer starves refresh of healthy databases. The reconcile
   execution is extracted into a pure, operation-injected
   ArcadeStateMachine.executeReconcilePlan(plan, acquireOp, refreshOp) that runs
   every database with per-database failure isolation: a persistently corrupt new
   DB (fails validation each pass) no longer aborts before the already-replicated
   databases are refreshed. Per-DB failures are collected, statuses applied from
   the outcome, and the overall install still fails at the end so Ratis re-triggers
   (installs are idempotent).
2. Filter leader databases the leader reported as unreadable (lastTxId < 0): such a
   database is not a usable snapshot source, so it is skipped (and logged) rather
   than repeatedly attempted and failed.
3. Minor: documented that the read-only validate-in-staging open derives the name
   from the directory and leaves no artifact carried across the atomic rename
   (covered by SnapshotAcquireNewDatabaseIT); noted the address suppliers are
   intentionally constant for a fixed InstallSnapshot leader.

Tests: ReconcilePlanTest now also asserts a failing acquire still lets healthy
refreshes run (no starvation) and the all-succeed path.
@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review: HA auto-acquire of never-seen databases (#4727)

Solid, well-engineered PR. The crash-safety design (stage under a reserved .acquire-<name>/, validate while still staged, publish with a single atomic rename) is the right shape, the set logic and execution are cleanly factored into pure/package-private helpers with good unit tests, and the docstrings explain the why unusually well. The additive-only guard (never drop a DB the leader lacks) plus the LEADER_MISSING alert is a thoughtful answer to the #4522 "empty node gets elected leader" hazard. Nice work.

A few things worth a look before merge:

Potential issues

1. A persistently-failing acquire hot-loops the whole reconcile (medium).
executeReconcilePlan isolates per-DB failures, but reconcileDatabasesFromLeader still throws at the end if any DB failed, so Ratis re-triggers the entire InstallSnapshot. On the next pass the bad DB is still in toAcquire (still missing locally) and every healthy DB is re-downloaded and refreshed again. A single DB whose snapshot opens on the leader but fails validateSnapshotOpens on this node (corruption that passes the ZIP ratio/path checks but not the open) therefore drives an unbounded re-download of all databases on this follower. The leader's HA_SNAPSHOT_MAX_CONCURRENT cap limits blast radius, but it's still churn. Worth considering a backoff or a "stop retrying this DB after N consecutive validation failures" so one bad DB doesn't keep re-pulling the healthy ones. The lastTxId < 0 skip handles unreadable-on-leader, but not fails-to-open-here.

2. validateSnapshotOpens opens without the server's security/config (low).
It does new DatabaseFactory(stagingPath).open(READ_ONLY) with neither factory.setSecurity(server.getSecurity()) nor the server's ContextConfiguration, whereas the real registration in ArcadeDBServer.getDatabase passes setSecurity(...) and reads SERVER_DEFAULT_DATABASE_MODE. For a pure smoke-test of component/schema loading this is probably fine (per-DB config lives in the directory), but it means the validation open can succeed under different settings than the eventual live open. If a deployment ever relies on server-level config to open a DB, validation could pass and the post-publish getDatabase() then fail (handled, but only after the atomic rename). Worth a one-line confirmation that this can't diverge, or just thread the server's security/config through for parity.

3. No assertion that an acquired DB keeps receiving replication (low, test gap).
SnapshotAcquireNewDatabaseIT proves the DB is acquired with matching record counts, but not that a subsequent write on the leader propagates to the newly-registered follower copy. Registration goes through the same server.getDatabase() path the bootstrap flow uses, so it's very likely wired correctly - but a "leader writes one more vertex after acquire -> follower sees it" assertion would close the loop on the part that actually matters operationally (ongoing replication, not just the one-time pull).

Minor / nits

  • INSTALLS_IN_FLIGHT.add() == false proceeds anyway (acquireNewDatabase), only logging a WARNING. It mirrors install() and relies on "Ratis serializes InstallSnapshot per follower," but if that invariant ever breaks, two acquisitions share one staging dir and race on the rename. Since the warning already declares it a coordination bug, failing fast (return/throw) would be safer than double-staging.
  • Presence fan-out is sequential on the Undertow worker thread (worst case peers x HA_BOOTSTRAP_TIMEOUT_MS). Documented and opt-in, so acceptable, but a hung/slow peer holds the request thread for the full budget; a parallel fan-out or a tighter aggregate cap would harden it for larger clusters.
  • Studio: escapeHtml/globalPrompt/etc. all exist and the new tables consistently escape node/db names - good. The backend routes (/cluster/leader, /stepdown, /verify/) are all registered in RaftHAPlugin, so the new buttons aren't dangling.

Security

No new concerns. LeaderDatabaseQuery reuses the existing cluster-token + forwarded-root-user auth and the SSL/keystore path from SnapshotInstaller, prefers HTTPS when SSL is enabled (with a unit-tested scheme selector and a one-time plain-HTTP fallback warning), and the presence endpoint is leader-only + gated behind ?presence=true. Studio output is HTML-escaped throughout.

Tests

Good coverage of the pure logic (ReconcilePlanTest, LeaderDatabaseQueryTest, ClusterAlertsTest) and of crash recovery (SnapshotAcquireStagingRecoveryTest, incl. the combined staging+pending-swap case). The note explaining why a full-cluster restart IT wouldn't actually exercise this path (in-process Ratis catches up via log replay, not InstallSnapshot) is a good catch and justifies the direct acquireNewDatabase IT approach.

Overall: in good shape - issue #1 is the one I'd most want addressed (or explicitly accepted) before merge.

Reviewed with Claude Code.

…fast guard

Addresses the fourth review round:

1. A persistently-failing acquire no longer hot-loops the whole reconcile. A
   database that opens on the leader but fails validateSnapshotOpens here would,
   under the always-throw policy, make Ratis re-trigger InstallSnapshot every pass
   and re-download every healthy database on this follower. Now consecutive
   acquire/refresh failures are counted per database; after ACQUIRE_GIVE_UP_AFTER
   (3) in a row that database stops forcing the retry (left FAILED, surfaced in
   cluster status, retried on the next natural InstallSnapshot). The counter
   resets on any success and is pruned with the status map.
2. validateSnapshotOpens now opens with the server's security + auto-transaction
   so the read-only validation matches the settings of the eventual live open in
   ArcadeDBServer.getDatabase rather than diverging under defaults.
3. acquireNewDatabase now fails fast (throws) if an acquisition for the same
   database is already in flight, instead of proceeding and risking two
   acquisitions racing on the staging dir / atomic rename.

Tests: SnapshotAcquireNewDatabaseIT now also asserts a write made on the leader
after the acquisition replicates to the newly-registered follower copy (ongoing
replication, not just the one-time pull).
@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review of #4739 - HA auto-acquire of never-seen databases

Reviewed the full diff against the repo conventions. This is a well-engineered fix with a clear root-cause analysis, strong crash-safety design, and good separation of pure/testable logic. Summary below.

Strengths

  • Crash-safe acquisition is the highlight. Staging under a reserved databases/.acquire-<name>/ dir (skipped by the boot scan), validate-before-publish, single atomic rename, and startup cleanup in recoverPendingSnapshotSwaps is exactly the right shape. The reasoning for validating before the rename (no rollback copy exists for a never-seen DB; avoids an undeletable corrupt dir on Windows) is sound.
  • Additive-only guard (never drop a DB the leader lacks, flag LEADER_MISSING instead) correctly preserves the distinction between a real DROP_DATABASE_ENTRY and "leader never had it". Good safety call for the How to replicate databases when new cluster nodes are added in Kubernetes ? #4522 scenario.
  • Failure isolation + give-up counter (executeReconcilePlan, ACQUIRE_GIVE_UP_AFTER) prevents one persistently-bad DB from starving healthy refreshes or pinning Ratis in a re-install loop.
  • Pure, package-private, unit-tested logic (classifyReconcile, executeReconcilePlan, chooseEndpoint) - matches the codebase's preference for testable cores.
  • SSL-aware transport in LeaderDatabaseQuery, reusing the cluster-token + forwarded-root-user auth like every other peer RPC. The HttpClient per-HTTPS-call is closed (AutoCloseable on 21) so the selector thread is not leaked. Default-on with a documented mixed-cluster safety story.

Concerns / suggestions

  1. Integration coverage gap for the orchestration glue (main one). The PR (fairly) notes in-process Ratis catches up via log replay, so the IT calls acquireNewDatabase directly. But that means reconcileDatabasesFromLeader itself - the status-map transitions, the give-up/failure-count logic, the leader-unreachable fallback to refreshExistingDatabases, and the notifyInstallSnapshotFromLeader wiring - is exercised only by unit tests with injected ops, never end-to-end. Since reconcileDatabasesFromLeader is private, consider extracting the post-executeReconcilePlan status/counter bookkeeping into a package-private pure method (input: ReconcileOutcome + plan, output: status mutations) so the FAILED/ACQUIRED/LEADER_MISSING transitions and the retry-worthwhile decision get direct test coverage without a live cluster.

  2. Presence fan-out can block an Undertow worker for minutes. buildPresenceMatrix is sequential and uses HA_BOOTSTRAP_TIMEOUT_MS (default 120000ms) as the per-peer request timeout. A peer that accepts the TCP connection but hangs costs the full 2 min, x N peers, on the worker thread. The 5s connectTimeout only covers truly unreachable hosts. It is opt-in and leader-only as documented, but a 2-min-per-peer UI button is a sharp edge - consider a dedicated shorter timeout for the presence query (or the parallel fan-out the Javadoc already anticipates).

  3. Persistently-failing acquire keeps re-downloading. After ACQUIRE_GIVE_UP_AFTER, the give-up only stops forcing the Ratis retry; the DB still stays in plan.toAcquire() and re-downloads the bad snapshot on every subsequent natural reconcile. Acceptable (reconciles are infrequent) but worth a comment, or skip re-download while a DB is in long-term FAILED.

  4. Legacy fallback does not filter reserved DBs. refreshExistingDatabases iterates server.getDatabaseNames() unfiltered, while the new auto-acquire path explicitly skips RESERVED_DATABASE_PREFIX. This is faithful to the pre-existing behavior you extracted, so not a regression, but now that the two paths sit side by side the asymmetry is a small trap - consider applying the same reserved-prefix filter in the fallback for consistency.

  5. Minor (Windows): validateSnapshotOpens opens the staging dir read-only then closes before the atomic rename. A leaked file handle from that open would block the rename on Windows. Low risk (read-only close should release handles), just flagging the dependency.

Security

No issues spotted. Peer RPCs reuse the cluster token + forwarded root user; the new Studio endpoints (/cluster/leader, /stepdown, /verify/{db}) are pre-existing authenticated handlers; the presence matrix is gated behind ?presence=true and leader-only; no sensitive data is logged. Studio JS uses escapeHtml for rendered values and JSON.stringify for the onclick name argument.

Style/conventions

Consistent with the codebase: final everywhere, imports (no FQNs), JSONObject/JSONArray with default-arg getters, AssertJ assertThat(...), Apache headers on new files. No System.out, no new dependencies, Claude not credited as author.

Nice work overall - the crash-safety and additive-only design are the right instincts for an HA data path. The main thing I would push on before merge is closing the test gap on the reconcile bookkeeping (point 1).

Reviewed by Claude (Opus 4.8)

…+ fallback parity

Addresses the fifth review round:

1. Extracted the post-executeReconcilePlan bookkeeping (status-map transitions,
   failure-counter updates, and the give-up/retry-worthwhile decision) into a
   package-private ArcadeStateMachine.applyReconcileOutcome(plan, outcome). It has
   no leader/network I/O, so the ACQUIRED/FAILED/LEADER_MISSING transitions, the
   refresh-clears-stale-LEADER_MISSING behavior, and the give-up-after-N decision
   are now directly unit-tested (ReconcilePlanTest) - closing the gap where this
   orchestration glue was only exercised via injected ops.
2. Presence fan-out now uses a short 5s per-peer timeout instead of
   HA_BOOTSTRAP_TIMEOUT_MS (120s), so a peer that accepts the connection but hangs
   cannot tie up an Undertow worker for minutes x N peers.
3. refreshExistingDatabases (the legacy/fallback path) now skips reserved
   databases (RESERVED_DATABASE_PREFIX) for symmetry with the auto-acquire path.
4. Documented that a long-term-FAILED database is still attempted on the next
   natural InstallSnapshot (which re-downloads everything anyway) so it can recover
   if the leader's copy is fixed; it just no longer hot-loops.

Tests: ReconcilePlanTest adds direct coverage of applyReconcileOutcome - status
transitions, stale-LEADER_MISSING clearing on refresh, and the give-up threshold
(retry forced until N consecutive failures, then suppressed; reset on success).
@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown

Code Review - PR #4739: HA auto-acquire of databases a joining node has never seen

Reviewed the full diff against CLAUDE.md conventions. This is a well-engineered change that closes a real correctness gap (#4727 / #4522): an empty node joining a Raft cluster now reconciles against the leader and crash-safely pulls databases it has never seen. Overall the design, defensiveness, and test discipline are strong. Notes below.

What's done well

  • Crash-safe acquisition. Staging under a reserved databases/.acquire-<name>/ (skipped by the boot scan) plus a single ATOMIC_MOVE publish is exactly the right pattern - no half-written databases/<name>/ can ever be opened by loadDatabases. The recoverPendingSnapshotSwaps cleanup of leftover staging dirs closes the loop, and SnapshotAcquireStagingRecoveryTest proves it.
  • Validate-before-publish. Opening the snapshot READ_ONLY while still under the reserved name, and discarding on failure, is the safe choice for a never-seen DB with no rollback copy. I confirmed SnapshotHttpHandler zips only specific files (config/schema/component/sealed) and not database.lck, so the READ_ONLY validation open won't trip the "needs recovery" guard in checkForRecovery. Good.
  • Failure isolation + give-up. Per-database isolation so one bad snapshot doesn't starve healthy refreshes, plus the ACQUIRE_GIVE_UP_AFTER counter to stop a persistently-bad DB from hot-looping Ratis InstallSnapshot, is a thoughtful operational detail.
  • Additive-only guard. Never dropping a locally-present DB the leader lacks (distinguishing it from a real DROP_DATABASE_ENTRY) keeps mixed/rolling clusters safe, and the leader-missing-databases alert surfaces it.
  • Testability. Extracting classifyReconcile, executeReconcilePlan, applyReconcileOutcome, and chooseEndpoint as pure/package-private functions makes the set logic, status transitions, and SSL scheme selection unit-testable without a cluster. Nice coverage (ReconcilePlanTest, LeaderDatabaseQueryTest, ClusterAlertsTest) plus a real end-to-end SnapshotAcquireNewDatabaseIT driving the actual HTTP download/publish/register path.
  • SSL-aware transport in LeaderDatabaseQuery (prefer HTTPS when SSL is on, one-time fallback warning) keeps the feature working on SSL-only StatefulSet deployments - the exact target environment.
  • Config default true with a clear, accurate Javadoc; *IT.java correctly lands in the failsafe (integration) phase, not the fast unit path.

Suggestions / minor issues

  1. executeReconcilePlan only isolates IOException. DbInstallOp.run is declared throws IOException, so a RuntimeException from one database's acquire/refresh (e.g. an NPE, or an unchecked failure escaping server.getDatabase) would propagate out of the loop and abort the entire reconcile - defeating the no-starvation goal that is the whole point of this method. acquireNewDatabase/install do wrap most failures into IOException, so this is robustness rather than a known crash, but consider catching Exception (or at least RuntimeException) per database and recording it as a failure so the isolation guarantee is complete.

  2. Presence fan-out can open closed DBs on remote peers. The Javadoc honestly calls this out: buildPresenceMatrix -> LeaderDatabaseQuery.fetch -> bootstrap-state may open a closed database on the queried peer to fingerprint it. Since this path is opt-in/leader-only it's acceptable, but it's a side effect an operator clicking "Load presence matrix" wouldn't expect (a status view triggering DB loads cluster-wide). Worth a one-line hint in the Studio UI, or a no-open variant of the query if presence only needs names.

  3. New HttpClient per HTTPS query. fetch builds and closes a fresh HttpClient for each HTTPS call. In the presence fan-out that's one client (and selector thread) per peer per request. Fine for the infrequent reconcile and opt-in presence path, but if this ever moves onto a hot path, consider caching an SSL-configured client.

  4. Residual limitation (not introduced here): auto-acquire only runs on the follower InstallSnapshot path, so a brand-new empty node that somehow becomes leader without first receiving a snapshot never acquires. In practice Raft's up-to-date-log election rule prevents an empty-log node from winning over nodes with data, and the new alert + leadership-transfer flow mitigate the How to replicate databases when new cluster nodes are added in Kubernetes ? #4522 edge. Just confirming this is understood as the documented boundary.

  5. Cosmetic: the retry IOException message concatenates two maps - outcome.acquireFailures() + outcome.refreshFailures() - rendering as {a=..}{b=..}. A small format string would read better in logs.

  6. Studio: name embedding via JSON.stringify(name).replace(/"/g, "&quot;") into the onclick attribute is reasonable given DB-name constraints; escapeHtml is used consistently elsewhere. No action needed unless DB names can contain </'.

Verdict

Solid, defensive, and well-tested. Item (1) is the only one I'd suggest addressing before merge (small change, completes the failure-isolation guarantee); the rest are minor/optional.

Automated review by Claude. Treat as advisory.

Addresses the sixth review round:

1. executeReconcilePlan now isolates Exception (not just IOException) per database.
   An unchecked failure escaping one database's acquire/refresh (e.g. an NPE) would
   otherwise propagate and abort the whole reconcile, defeating the no-starvation
   guarantee. Each is now recorded as a failure and the others proceed.
2. Cleaner retry IOException message (acquire=.. , refresh=..) instead of two
   concatenated maps.
3. Documented the boundary that auto-acquire runs only on the follower
   InstallSnapshot path (Raft's up-to-date-log election rule + the LEADER_MISSING
   alert / leadership-transfer flow keep it unreachable in normal operation).
4. Studio: a hint under "Load presence matrix" that it queries each node and a node
   may briefly open a closed database to report it - run on demand.
5. Noted HttpClient-per-HTTPS-call is fine for these infrequent paths; cache an
   SSL client if it ever moves onto a hot path.

Tests: ReconcilePlanTest adds a runtime-exception isolation case (a RuntimeException
from one acquire still lets the healthy databases refresh).
@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review: HA auto-acquire of never-seen databases (#4727)

Thanks for this - it's a well-structured fix for a real gap (empty StatefulSet nodes silently missing runtime-created databases). The crash-safety design and the deliberate extraction of pure, unit-testable logic stand out. Detailed feedback below.

Strengths

  • Crash-safe acquisition is the highlight: reserved .acquire-<name> staging dir (skipped by the boot scan) + validate-before-publish + single atomic rename. The ordering is correct - validating while still under the reserved name means a corrupt download is simply discarded and the DB stays absent (the safe state), and no half-written non-reserved databases/<name>/ can ever be opened by loadDatabases. The Windows-handle rationale for validating pre-rename is a nice touch.
  • Testability via purity: classifyReconcile, executeReconcilePlan, chooseEndpoint, addLeaderMissingAlert, and applyReconcileOutcome are all free of network/Raft I/O and directly unit-tested. The ReconcilePlanTest / LeaderDatabaseQueryTest coverage is solid.
  • Failure isolation (one bad DB never starves the refresh of healthy ones) plus the catch-Exception-not-just-IOException reasoning is correct and tested.
  • Additive-only guard (LEADER_MISSING never drops) keeps a real DROP_DATABASE_ENTRY distinguishable from "leader never had it".
  • SSL-aware transport in LeaderDatabaseQuery (prefers HTTPS, one-time fallback warning) matters precisely for the SSL-only deployments this targets, and it's tested.
  • Config gate (HA_AUTO_ACQUIRE_DATABASES, default true, re-read per boot) is backward-compatible; the legacy refresh-only fallback on leader-unreachable means the install never regresses. Cross-guard via INSTALLS_IN_FLIGHT keyed by resolved path is consistent with install().

Concerns / suggestions

  1. Whole-install retry amplification (performance). When applyReconcileOutcome returns true, reconcileDatabasesFromLeader throws and Ratis re-triggers the entire InstallSnapshot, re-downloading every healthy database again. For one persistently-bad DB this is up to ACQUIRE_GIVE_UP_AFTER (3) full re-syncs of all databases on the follower. On a node holding many/large DBs that's expensive. The give-up counter bounds the loop (good), but a per-database retry/backoff that doesn't drag the healthy DBs through repeated full downloads would scale better. At minimum the cost is worth calling out in the setting's docs.

  2. Give-up leaves a silent availability gap. Once a DB hits the give-up threshold, the install is reported successful (returns firstTermIndexInLog), so Ratis advances and won't re-trigger. The FAILED DB is then only retried on the next natural InstallSnapshot - which Ratis avoids in favor of log replay - so it could stay absent indefinitely if the leader's copy is later fixed. The LEADER_MISSING alert doesn't cover this (different state). Consider also raising a cluster alert for databases stuck FAILED past the give-up threshold.

  3. Validation correctness hinges on the name not being persisted. validateSnapshotOpens opens under .acquire-<name> and you later open under <name>, relying on "ArcadeDB derives the database name from the directory at open time." The end-to-end IT confirms the published DB opens, which is reassuring, but it's a load-bearing assumption - worth a brief inline note pointing at where the name is (not) persisted, for future maintainers.

  4. Publish-time TOCTOU. Between the pre-publish existsDatabase re-check and deleteDirectoryIfExists(dbPath) + atomic rename, a concurrent INSTALL_DATABASE_ENTRY replay that creates the DB through a path not guarded by INSTALLS_IN_FLIGHT would race on dbPath. Almost certainly prevented by Ratis applying the log single-threaded relative to the snapshot-install path, but deleteDirectoryIfExists(dbPath) (which could blow away a concurrently-created live dir) rests entirely on that serialization. A one-line comment asserting the guarantee would make the invariant explicit.

  5. Presence fan-out is sequential on an Undertow worker thread (worst case peers x 5s). Opt-in and leader-only is the right call, and you already note a parallel fan-out would help large clusters. If you parallelize later, please honor the CLAUDE.md concurrency rule (no ForkJoinPool.commonPool() for server-internal work - use a dedicated bounded pool wired into PoolMetrics).

  6. Minor: counter cleanup gaps. acquireFailureCounts is pruned only on the auto-acquire success path; notifyLeaderChanged clears stale LEADER_MISSING statuses but not the parallel failure counters, and neither map is pruned on the non-auto-acquire / leader-unreachable branches. Bounded by database count, just noting for tidiness.

  7. Minor (Studio). Display paths consistently use escapeHtml / JSON.stringify+&quot; escaping - good. The only unescaped concatenations are the globalNotify / globalAlert message strings (e.g. "Verifying '" + databaseName + "'..."); fine if those helpers render text not HTML, but worth a quick confirm since DB names are user-controlled.

Tests

Coverage is strong for the new deterministic surface: reconcile set logic, no-starvation execution, status/give-up bookkeeping, endpoint scheme selection, staging crash recovery (incl. the combined acquire-staging + pending-swap pass), and an end-to-end acquireNewDatabase IT that also asserts post-acquire replication keeps flowing. The rationale for not using a full-cluster-restart IT (in-process Ratis catches up via log replay, not InstallSnapshot) is sound. One gap: no test for the applyReconcileOutcome-throws -> Ratis-retry path at the reconcileDatabasesFromLeader level (only applyReconcileOutcome in isolation), but that's hard to drive in-process, so the unit-level coverage is a reasonable compromise.

Overall a careful, defensively-coded change. The main things I'd push on before merge are items #1/#2 (retry amplification + the silent FAILED gap), since they affect behavior under real failure rather than the happy path.

Reviewed against CLAUDE.md conventions (final usage, brace style, no new deps, concurrency rules).

…riant notes

Addresses the seventh review round:

1. New cluster alert (failed-acquire-databases) for databases left FAILED after the
   acquire give-up. Because a given-up database is only retried on the next natural
   InstallSnapshot (which Ratis avoids via log replay), it could stay absent even
   after the leader's copy is fixed; the alert surfaces it so an operator can force
   a resync. Distinct from the LEADER_MISSING alert.
2. Documented the whole-install retry amplification in HA_AUTO_ACQUIRE_DATABASES: a
   persistently-failing database is retried a bounded number of times, and each
   retry re-downloads the other databases too (capped, so it cannot loop forever).
3. Added a one-line invariant comment at the publish delete+rename noting it rests
   on Ratis applying the log single-threaded relative to the snapshot-install path.
4. Clear acquireFailureCounts on become-leader (parallel to the LEADER_MISSING
   status cleanup); strengthened the validate-in-staging name-derivation note.
5. Presence-matrix Javadoc now records the CLAUDE.md concurrency rule for any future
   parallel fan-out (no common ForkJoinPool; dedicated bounded pool + PoolMetrics).

Confirmed globalNotify renders via textContent (not HTML), so the Studio message
strings with database names are injection-safe; verifyDatabase escapes its inputs.

Tests: ClusterAlertsTest adds failed-acquire alert coverage (raised with names /
absent when empty).
@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown

PR Review: HA auto-acquire of never-seen databases (issue 4727)

Thorough, well-documented change. The crash-safety design is the strongest part, and the decomposition into pure, unit-testable units is exactly right for this kind of distributed-systems logic. Findings ordered by severity.

Bug (functional): "Transfer leadership" peer picker never sends the selected peer

In studio/src/main/resources/static/js/studio-cluster.js, transferLeadershipPrompt() builds a <select id="transferPeerId"> and reads the choice from the globalPrompt callback argument:

globalPrompt("Transfer Leadership", html, "Transfer", function(values) {
  var peerId = values["transferPeerId"];
  if (!peerId) return;   // always undefined for a <select>, so this returns early

But globalPrompt only captures input/textarea values, not select (studio-utils.js:102: body.querySelectorAll("input, textarea")). The established convention is to read form fields directly inside the callback via jQuery, e.g. studio-database.js:1628 reads a <select> with $("#inputCreateTypeParents").val(). As written, peerId is always undefined, so transfer-leadership silently does nothing.

Fix: read the select directly, consistent with the rest of Studio: var peerId = $("#transferPeerId").val();. Alternatively broaden globalPrompt to querySelectorAll("input, textarea, select"), but that touches a shared helper - the local fix is safer. Worth a manual test of the new leadership/verify/presence buttons since there is no frontend test coverage.

Minor: stale FAILED status survives on a node that becomes leader

notifyLeaderChanged clears LEADER_MISSING statuses and acquireFailureCounts when a node becomes leader, but leaves FAILED entries in acquireStatuses. Since the leader never runs the follower reconcile path, a FAILED entry (and its failed-acquire-databases alert) can linger on the new leader even though it is no longer meaningful there. Consider clearing FAILED on leader change too, mirroring the LEADER_MISSING cleanup.

Behavior change worth calling out: legacy refresh path now skips reserved DBs

The old loop in notifyInstallSnapshotFromLeader iterated all server.getDatabaseNames() and installed each existing one. The new refreshExistingDatabases adds a RESERVED_DATABASE_PREFIX skip. Almost certainly an improvement (the leader never serves reserved DBs as snapshots), but it is a behavior change on the autoAcquire=false path too - flagging it so it is intentional.

Things I verified that look correct

  • READ_ONLY snapshot validation is safe. I checked whether validateSnapshotOpens (READ_ONLY) could reject a snapshot containing a stale database.lck ("needs recovery but open in read only mode", LocalDatabase.java:2138). It cannot: SnapshotHttpHandler only zips config/schema/component/.ts.sealed files, never database.lck, and a READ_ONLY open creates no lock file (LocalDatabase.java:2157-2158).
  • Crash-safety: staging under reserved .acquire-<name>/ + validate-before-publish + single ATOMIC_MOVE + startup cleanup of leftover staging dirs is clean and correct. No partial non-reserved directory can ever be opened by the boot scan.
  • Failure isolation / no-starvation in executeReconcilePlan (catching Exception, not just IOException) is the right call and is well tested.
  • Give-up cap (ACQUIRE_GIVE_UP_AFTER) correctly stops a persistently-bad database from hot-looping Ratis InstallSnapshot. The documented consequence (a DB can stay absent until the next natural InstallSnapshot, since Ratis prefers log replay) makes the operator alert load-bearing - good that it exists.
  • All referenced endpoints (/cluster/leader, /stepdown, /verify/, /resync/, /bootstrap-state) and RaftHAServer accessors exist; SSL endpoint selection honors HTTPS-only clusters.

Performance / security

  • Presence fan-out is sequential on an Undertow worker (worst case peers x 5s), but it is opt-in (?presence=true) and leader-only, and the code documents the dedicated-pool path if ever parallelized - consistent with the CLAUDE.md concurrency rule. Fine as-is.
  • Per-HTTPS-call HttpClient construction is acceptable for these cold paths and is closed via try-with-resources; the comment already notes caching if it moves onto a hot path.
  • Inter-node auth reuses the cluster token + forwarded-root-user pattern; no new exposure. Studio output uses escapeHtml/encodeURIComponent consistently.

Test coverage

Strong on the pure logic (classifyReconcile, executeReconcilePlan, applyReconcileOutcome, chooseEndpoint, alerts, staging recovery) plus an end-to-end acquireNewDatabase IT. Reasonable-to-defer gaps: the leader-list-fetch-fails to degrade-to-refresh fallback in reconcileDatabasesFromLeader, buildPresenceMatrix assembly, and the new Studio JS (no frontend test harness exists). Since the transfer-leadership bug slipped through precisely because the JS is untested, a manual smoke test of the three new Studio buttons before merge is recommended.

Nice work overall - once the Studio select capture is fixed, this is in good shape.

…become-leader

Addresses the eighth review round:

1. Bug fix (functional): the "Transfer leadership" peer picker never sent the
   selected peer. globalPrompt only captures input/textarea values, not <select>,
   so values["transferPeerId"] was always undefined and the transfer silently did
   nothing. Read the select directly with $("#transferPeerId").val(), matching the
   convention used elsewhere in Studio.
2. Minor: become-leader cleanup now also clears FAILED acquisition statuses (not
   just LEADER_MISSING), so the failed-acquire-databases alert does not linger on a
   node that becomes leader, where the follower reconcile state is meaningless.
   ACQUIRED history is kept.

The legacy refresh path now skipping reserved databases (flagged in review) is
intentional: the leader never serves reserved DBs as snapshots, so skipping them is
correct on both the auto-acquire and the autoAcquire=false fallback paths.
@codacy-production

codacy-production Bot commented Jun 25, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics -136 complexity

Metric Results
Complexity -136

View in Codacy

🟢 Coverage 30.99% diff coverage · -7.43% coverage variation

Metric Results
Coverage variation -7.43% coverage variation
Diff coverage 30.99% diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (dc19716) 131833 98036 74.36%
Head commit (b87b7ae) 164198 (+32365) 109902 (+11866) 66.93% (-7.43%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#4739) 355 110 30.99%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown

PR Review: HA auto-acquire of databases a joining node has never seen (#4727)

Reviewed the full diff. This is a well-structured, carefully-reasoned change. The decomposition into pure, package-private, unit-testable pieces (classifyReconcile, executeReconcilePlan, applyReconcileOutcome, chooseEndpoint, the alert builders) is excellent and makes the genuinely-hard-to-test Raft path verifiable without a live cluster. The crash-safety design (stage under reserved .acquire-<name>, validate in staging, single atomic rename, recover on boot) is sound, and the additive-only guard with LEADER_MISSING flagging instead of dropping is the right call. Verified all four Studio-invoked endpoints (/cluster/leader, /stepdown, /verify/, /resync/) and the bootstrap-state response shape exist and match. Below are observations, ordered by importance.

Correctness / design

  1. Reporting InstallSnapshot success while a leader-held DB is still absent (give-up path). After ACQUIRE_GIVE_UP_AFTER (3) consecutive failures, applyReconcileOutcome returns false, so reconcileDatabasesFromLeader does not throw and notifyInstallSnapshotFromLeader returns success to Ratis. Ratis then considers this follower caught up at that index even though a database the leader holds was never installed (left FAILED). If that follower is later elected leader, you reproduce the exact How to replicate databases when new cluster nodes are added in Kubernetes ? #4522 condition (a leader missing a DB). The tradeoff vs. hot-looping the whole InstallSnapshot is reasonable and the failed-acquire-databases alert provides detection, but this is the one place where the "additive, never regress" invariant is intentionally relaxed. Worth confirming the alert is loud enough operationally, and consider whether a give-up DB should also block this node from accepting leadership (it currently does not).

  2. Behavior change on the refresh (existing-DB) path. Previously, a refresh/install failure of an already-present DB propagated and failed the whole install, so Ratis retried indefinitely. Now refresh failures are isolated and subject to the same 3-strike give-up. This is generally more robust (no starvation), but it is a semantic change for the legacy refresh path: a persistently failing refresh of an existing replica now stops forcing retry after 3 attempts. Please confirm this is intended for the non-auto-acquire (HA_AUTO_ACQUIRE_DATABASES=false) case too - the gating only changes acquire vs. refresh-only, not the give-up behavior, so even legacy users get the new give-up semantics.

  3. ACQUIRED status is cleared on the next reconcile. Once an acquired DB becomes local, the next InstallSnapshot puts it in toRefresh, and applyReconcileOutcome does acquireStatuses.remove(dbName) for refreshed DBs. So the ACQUIRED state surfaced in GET /api/v1/cluster is short-lived and disappears at the next snapshot install. That's fine functionally, but the PR text ("ACQUIRED is harmless history and is kept") slightly oversells its persistence. Not a bug.

Performance

  1. Presence fan-out is sequential on an Undertow worker thread. buildPresenceMatrix loops peers with a 5s per-peer timeout, so worst case is peers x 5s blocking a worker. You've documented this and it's opt-in + leader-only, so acceptable for now, but for larger clusters a bounded parallel fan-out (honoring the CLAUDE.md no-common-pool rule) would be worth a follow-up. Also note each queried peer may open a closed DB to fingerprint it - good that this is gated behind ?presence=true.

  2. HTTPS client built per call in LeaderDatabaseQuery.fetch. Acceptable given the infrequent paths and the try-with-resources cleanup (Java 21 HttpClient is AutoCloseable), and you've already flagged caching if it ever hits a hot path. No action needed.

Test coverage

  • Strong unit coverage on the pure logic (reconcile classification, execution isolation incl. unchecked exceptions, give-up/reset transitions, endpoint scheme selection, alert builders, staging recovery incl. the combined recovery pass). The end-to-end SnapshotAcquireNewDatabaseIT covers both the happy path and the concurrent-create refresh fallback, and asserts no leftover staging/marker files plus post-acquire replication continuity - exactly the right things.
  • Gaps (acceptable given they are network/Raft-bound, but worth noting): reconcileDatabasesFromLeader orchestration itself is untested - specifically the autoAcquire=false fallback, the leader-unreachable degrade-to-refresh branch, the lastTxId < 0 skip, and the status/counter pruning. Since those branches are pure-ish, extracting the leader-DB-set filtering into a small package-private helper would let you unit-test the lastTxId < 0 skip and pruning the way you did for the rest. buildPresenceMatrix is also untested.
  • Tag check: the IT extends BaseRaftHATest (spins 3 servers) - confirm it's excluded from the fast CI path per CLAUDE.md conventions (IT suffix / surefire-failsafe split). SnapshotAcquireNewDatabaseIT looks correctly named.

Minor / nits

  • ArcadeStateMachine now carries a fair amount of reconcile orchestration in addition to the Ratis state-machine contract. If it keeps growing, consider extracting a DatabaseReconciler collaborator. Not blocking.
  • Studio: escapeHtml is used consistently for the matrix/verify tables (good, no XSS via DB/peer names). Two globalNotify(...) calls interpolate peerId/databaseName unescaped into the message string; values originate from the trusted server peer/DB list so risk is low, but escaping there too would be consistent.
  • Config description (HA_AUTO_ACQUIRE_DATABASES) is thorough and accurate. Good that it documents the per-node, read-live, additive semantics and the retry-amplification caveat.

Verified

  • All Studio-referenced endpoints are registered in RaftHAPlugin.
  • bootstrap-state response (databases[].name, lastTxId, -1 for unreadable) matches LeaderDatabaseQuery.parse and the lastTxId < 0 skip.
  • new ArcadeStateMachine() (no-arg) matches the real construction in RaftHAServer, so the unit tests exercise the real object.
  • JSON access uses the default-value getter (getLong("lastTxId", -1L)) per project convention.

Overall: solid, defensive, well-tested work that directly addresses the root cause. My main ask is an explicit decision on item #1 (success-to-Ratis while a DB is still missing) and confirmation on #2 (give-up now applies to the legacy refresh path).

Reviewed by Claude (claude-opus-4-8). Compilation was not run in this review environment; relying on the PR's reported 355/0 ha-raft suite result.

@lvca lvca merged commit 2d111ea into main Jun 26, 2026
25 of 33 checks passed
lvca added a commit that referenced this pull request Jun 26, 2026
…StateMachine

Pure refactor (no behavioral change) following #4727/#4739: moves the database
reconciliation orchestration out of ArcadeStateMachine into a dedicated
DatabaseReconciler collaborator, separating it from the core Ratis state-machine
contract.

Moved into DatabaseReconciler:
- reconcileDatabasesFromLeader() and refreshExistingDatabases()
- classifyReconcile(), executeReconcilePlan(), applyReconcileOutcome(),
  bumpFailureAndShouldRetry()
- acquireStatuses / acquireFailureCounts state and ACQUIRE_GIVE_UP_AFTER
- AcquireStatus, AcquireState, ReconcilePlan, ReconcileOutcome, DbInstallOp types
- the become-leader follower-state cleanup (clearFollowerReconcileStatesOnBecomeLeader)

ArcadeStateMachine now holds a single reconciler instance, delegates from
notifyInstallSnapshotFromLeader/notifyLeaderChanged/setServer, and exposes it via
getReconciler(). GetClusterHandler and ClusterAlerts read statuses through the
reconciler. ReconcilePlanTest migrated to DatabaseReconcilerTest.

(cherry picked from commit c6f25dee589003c2a11cb8011a032f7e2a4a9116)
@lvca lvca deleted the worktree-issue-4727-ha-auto-acquire branch July 3, 2026 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HA: new cluster node does not auto-acquire databases it has never seen (only refreshes existing ones)

1 participant