feat: [#4727] HA auto-acquire of databases a joining node has never seen#4739
Conversation
A node joining a Raft HA cluster with an empty data directory now reconciles its local database set against the leader's and pulls (full snapshot install) any database it is missing, instead of only refreshing databases already on disk. Fixes the #4522 StatefulSet scale-up case where new pods only had the defaultDatabases entry. - New arcadedb.ha.autoAcquireDatabases (default true) gates the join-time pull; create-time push via INSTALL_DATABASE_ENTRY stays mandatory (core consensus). - notifyInstallSnapshotFromLeader enumerates the leader's databases (reusing the bootstrap-state RPC) and installs the union; additive-only, never drops a database the leader lacks (flagged LEADER_MISSING). - SnapshotInstaller.acquireNewDatabase: crash-safe never-seen acquisition staged under reserved databases/.acquire-<name>/ and published with an atomic rename, so a crash can never leave a half-written database the startup scan would open. - Observability: per-db acquireStatus in /api/v1/cluster, leader-missing alert, optional ?presence=true per-node/per-db presence fan-out. - Studio Cluster tab: transfer leadership, step down, verify consistency, and a per-database x per-node presence matrix. Tests: SnapshotAcquireStagingRecoveryTest (crash recovery of the staging dir), SnapshotAcquireNewDatabaseIT (end-to-end acquire from a live leader).
|
Tick the box to add this pull request to the merge queue (same as
|
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | -136 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Code Review
This pull request implements crash-safe auto-acquisition of unseen databases for joining cluster nodes (issue #4727). It introduces a reserved staging directory ('.acquire-') with atomic renames to prevent corrupt databases from being loaded on startup, exposes a database presence matrix and leadership controls in the Studio UI, and adds comprehensive integration and recovery tests. The review feedback highlights a critical bug in SnapshotInstaller where a failed database open prevents proper handle closure and cleanup on Windows due to file locking. Additionally, the feedback identifies two instances in ArcadeStateMachine and GetClusterHandler where catching general exceptions swallows InterruptedException without restoring the thread's interrupted status.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| try { | ||
| if (server.existsDatabase(databaseName)) | ||
| ((DatabaseInternal) server.getDatabase(databaseName)).getEmbedded().close(); | ||
| } catch (final Exception ignore) { | ||
| // best-effort close before removal | ||
| } |
There was a problem hiding this comment.
In the catch (final RuntimeException openEx) block, calling server.getDatabase(databaseName) will attempt to load/open the database again. Since the database just failed to open, this call will re-throw the exact same RuntimeException, which is caught and ignored by the inner catch (final Exception ignore). As a result, db.getEmbedded().close() is never executed.
On Windows, failing to close the database handles will prevent deleteDirectoryIfExists(dbPath) from succeeding due to file locking. This leaves a corrupt/partial database directory under the non-reserved name databases/<databaseName>/. On the next startup, ArcadeDBServer.loadDatabases will attempt to open this corrupt directory and crash the server, defeating the crash-safety design of acquireNewDatabase.
Using server.getDatabase(databaseName, false, false) avoids re-triggering the load/open exception and allows retrieving the registered instance safely to close it.
| try { | |
| if (server.existsDatabase(databaseName)) | |
| ((DatabaseInternal) server.getDatabase(databaseName)).getEmbedded().close(); | |
| } catch (final Exception ignore) { | |
| // best-effort close before removal | |
| } | |
| try { | |
| if (server.existsDatabase(databaseName)) { | |
| final DatabaseInternal db = (DatabaseInternal) server.getDatabase(databaseName, false, false); | |
| if (db != null) | |
| db.getEmbedded().close(); | |
| } | |
| } catch (final Exception ignore) { | |
| // best-effort close before removal | |
| } |
| } catch (final Exception e) { | ||
| LogManager.instance().log(this, Level.WARNING, | ||
| "Could not list the leader's databases for auto-acquire (%s); refreshing only the databases already " | ||
| + "present locally. Missing databases will be retried on the next reconcile.", e.getMessage()); | ||
| refreshExistingDatabases(leaderHttpAddr, leaderHttpsAddr, clusterToken); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Swallowing InterruptedException without restoring the interrupted status of the thread is a bad practice in Java concurrency. When InterruptedException is caught, the thread's interrupted status is cleared. If it is not restored by calling Thread.currentThread().interrupt(), upstream callers or the thread pool executor may not realize the thread was interrupted, potentially preventing a clean shutdown.
| } catch (final Exception e) { | |
| LogManager.instance().log(this, Level.WARNING, | |
| "Could not list the leader's databases for auto-acquire (%s); refreshing only the databases already " | |
| + "present locally. Missing databases will be retried on the next reconcile.", e.getMessage()); | |
| refreshExistingDatabases(leaderHttpAddr, leaderHttpsAddr, clusterToken); | |
| return; | |
| } | |
| } catch (final InterruptedException e) { | |
| Thread.currentThread().interrupt(); | |
| LogManager.instance().log(this, Level.WARNING, | |
| "Database reconciliation interrupted; refreshing only the databases already present locally."); | |
| refreshExistingDatabases(leaderHttpAddr, leaderHttpsAddr, clusterToken); | |
| return; | |
| } catch (final Exception e) { | |
| LogManager.instance().log(this, Level.WARNING, | |
| "Could not list the leader's databases for auto-acquire (%s); refreshing only the databases already " | |
| + "present locally. Missing databases will be retried on the next reconcile.", e.getMessage()); | |
| refreshExistingDatabases(leaderHttpAddr, leaderHttpsAddr, clusterToken); | |
| return; | |
| } |
| } catch (final Exception e) { | ||
| LogManager.instance().log(this, Level.WARNING, | ||
| "Presence matrix: could not query peer '%s' (%s)", peerIdStr, e.getMessage()); | ||
| unreachable.add(peerIdStr); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Similar to the state machine reconciliation, catching Exception here swallows InterruptedException without restoring the thread's interrupted status. This can prevent proper thread pool management and clean shutdowns.
} catch (final InterruptedException e) {
Thread.currentThread().interrupt();
LogManager.instance().log(this, Level.WARNING,
"Presence matrix query interrupted for peer '%s'", peerIdStr);
unreachable.add(peerIdStr);
continue;
} catch (final Exception e) {
LogManager.instance().log(this, Level.WARNING,
"Presence matrix: could not query peer '%s' (%s)", peerIdStr, e.getMessage());
unreachable.add(peerIdStr);
continue;
}
PR Review: HA auto-acquire of databases a joining node has never seen (#4727)Reviewed against Correctness / potential bugs
Minor / style
Performance
Security
Test coverage
Nice work overall - the documentation on the staging/rename invariant is exemplary and made review straightforward. Automated review; please verify suggestions against intent. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4739 +/- ##
============================================
- Coverage 65.31% 65.23% -0.09%
- Complexity 567 622 +55
============================================
Files 1673 1674 +1
Lines 131833 132395 +562
Branches 28232 28320 +88
============================================
+ Hits 86103 86362 +259
- Misses 33833 34124 +291
- Partials 11897 11909 +12 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
- SnapshotInstaller.acquireNewDatabase now VALIDATES the downloaded snapshot while it is still under the reserved .acquire-<name> staging dir (open read-only + close) BEFORE the atomic publish. A corrupt snapshot is discarded in staging and never reaches databases/<name>/, closing the window where a failed post-publish open could leave an undeletable corrupt directory (Windows file-lock) that the startup scan would then try to open. Replaces the previous cleanup that re-opened the DB to close it (which re-threw and never closed). - Restore the interrupt flag on InterruptedException in reconcileDatabasesFromLeader and GetClusterHandler.buildPresenceMatrix instead of swallowing it via catch(Exception). - Prune stale acquireStatuses entries (databases no longer local nor on the leader) at reconcile start so the leader-missing alert clears once a database is dropped/redistributed. - Reuse RaftHAServer.FORWARDED_ROOT_USER (now package-private) in LeaderDatabaseQuery instead of the inlined "root" literal. Tests: ClusterAlertsTest covers the leader-missing alert builder (raised with names / absent when empty); SnapshotAcquireNewDatabaseIT adds the concurrent-create fallback (acquire on an existing DB refreshes in place).
|
Review: PR #4739 — HA auto-acquire of databases a joining node has never seen Thorough, well-documented change. The crash-safety design (reserved A few items, in priority order. 1.
On an SSL-enabled cluster where the plain-HTTP listener is disabled, 2.
3.
4. Minor
Tests Good coverage of the new code: staging recovery (interrupted acquire, coexisting pending-swap, completed-DB-untouched), the end-to-end acquire IT, and the concurrent-create refresh-fallback branch. The paths not covered are the SSL behavior of item 1 and the leader-unreachable degradation branch — a unit test asserting the fallback fires when Overall a solid fix for #4727. Item 1 is the only one I would consider blocking, since it determines whether the feature actually works on the SSL deployments most likely to be running a StatefulSet. Reviewed against CLAUDE.md conventions (HA concurrency rules, JSON helpers, test style). |
…s semantics Addresses code-review findings: 1. (blocking) LeaderDatabaseQuery now honors arcadedb.ssl.enabled instead of hardcoding http://: it prefers the peer's HTTPS endpoint and trusts the cluster keystore via SnapshotInstaller.buildSSLContext, falling back to plain HTTP with a one-time warning only when no HTTPS address is known - mirroring SnapshotInstaller.downloadWithRetry (#4470). Without this, auto-acquire and the presence matrix were silently disabled on SSL-only clusters (the plain-HTTP listener is typically off there), defeating the feature for the StatefulSet/empty-node case it targets. Both callers updated to pass the peer HTTPS address + server (reconcileDatabasesFromLeader, buildPresenceMatrix). 2. acquireNewDatabase now takes the same INSTALLS_IN_FLIGHT overlap guard install() uses, releasing it before delegating to install() on the concurrent-create fallback so no spurious overlap is logged. 3. acquireStatus (ACQUIRING/ACQUIRED/FAILED) is recorded only for genuinely-new acquisitions; a routine snapshot refresh of an already-present database is no longer reported as ACQUIRED. 4. Doc/wording: HA_AUTO_ACQUIRE_DATABASES is read live per reconcile (not per boot); note the presence fan-out's sequential cost and remote-open tension. Tests: LeaderDatabaseQueryTest covers scheme selection (http when SSL off, https when SSL on + https addr known, http fallback when SSL on but no https addr, null when no usable address).
Review: HA auto-acquire of unseen databases (#4727)Reviewed the full diff. This is a well-structured, carefully-reasoned change. The crash-safety design in A few things worth addressing. 1.
|
… plan
Addresses the second review round:
1. Sticky LEADER_MISSING status/alert. Once flagged, a LEADER_MISSING database's
status never cleared short of a restart, so the leader-missing cluster alert
lingered through the very resync / leadership-transfer recovery flow this PR
documents. Now:
- the refresh branch clears the acquisition status (acquireStatuses.remove)
when the leader regains a database this node already has, and
- becoming LEADER clears any LEADER_MISSING entries (meaningless on the leader,
which never runs the follower reconcile path).
2. Extracted the set-reconciliation into a pure, package-private
ArcadeStateMachine.classifyReconcile(leaderDbs, localDbs) -> ReconcilePlan
{toAcquire, toRefresh, leaderMissing} and drove reconcileDatabasesFromLeader
from it, so the union/difference + LEADER_MISSING logic is unit-testable
(ReconcilePlanTest) - the gap that let the sticky-status bug through.
3. Minor: validateSnapshotOpens now closes the DatabaseFactory too (try-with-
resources on both factory and database).
Tests: ReconcilePlanTest covers empty-node/full-replica/mixed/orphan splits and
sorted-output determinism.
Code Review: HA auto-acquire of never-seen databases (#4727)Reviewed the full diff plus the surrounding HA code. This is a well-engineered fix: the crash-safe staging/atomic-rename approach is the right pattern, the set-reconciliation logic is cleanly factored into a pure, unit-tested method, and the SSL-aware transport, additive-only guard, and graceful degradation on an unreachable leader all show careful thought. The Javadoc is genuinely useful. Tests cover the pure logic, crash recovery, and an end-to-end acquire. Nice work overall - the comments below are mostly refinements, with one robustness point worth discussing. Potential issue: one failing acquire starves refresh of healthy databasesIn `reconcileDatabasesFromLeader`, the `plan.toAcquire()` loop runs before `plan.toRefresh()`, and `acquireNewDatabase` rethrows `IOException` (e.g. from `validateSnapshotOpens`), which aborts the whole method: ```java Because Ratis re-triggers the same `InstallSnapshot`, a leader database whose snapshot persistently fails validation (corrupt/incompatible) would abort at the same point every pass, so the already-replicated databases on this follower never get refreshed - a single bad new DB indefinitely starves the healthy ones. Consider collecting per-database failures and continuing (still failing the overall install at the end so Ratis retries), or at least running the refresh loop before propagating an acquire failure. The `FAILED` status you already record would then reflect a partial result rather than an all-or-nothing abort. Minor: error-state leader databases are treated as present`PostBootstrapStateHandler` includes databases it could not open with `lastTxId=-1` / empty fingerprint, and `LeaderDatabaseQuery.parse` adds every entry to the set unconditionally. So a follower may try to `acquireNewDatabase` a database the leader itself cannot open. The download would likely fail validation and surface as `FAILED` (and feed the starvation above). Filtering out entries the peer reported an `error` / `lastTxId == -1` for might be cleaner, since such an entry is not a usable source. Minor: validateSnapshotOpens opens under the staging nameThe snapshot is opened read-only while still named `.acquire-`, then atomically renamed to ``. This relies on the database name being derived from the directory at open time (so no `.acquire-` name is persisted) and on the read-only open leaving no lock/WAL artifacts behind in staging that would then be published into the final directory. Both appear to hold for ArcadeDB, but it's worth a one-line confirmation since a stale artifact carried across the rename would be hard to diagnose. Minor: unused supplier flexibility`acquireNewDatabase` takes `Supplier` for the leader addresses, but every caller passes a captured constant (`() -> leaderHttpAddr`). If the intent was to re-resolve the current leader across retries, the constant lambda defeats it; if not, plain `String` params (matching `install`) would be simpler. Not a bug - just an inconsistency. Performance / scope notes (no action required)
StudioXSS handling looks solid - `escapeHtml` is applied to all interpolated peer/database names, and `verifyDatabase`'s onclick arg is JSON-encoded and quote-escaped. The graceful degradation when cards are absent (older `cluster.html`) is a nice touch. TestsGood coverage of the pure reconcile logic, endpoint/scheme selection, crash recovery (including the combined acquire-staging + pending-swap case), and an end-to-end acquire with the concurrent-create fallback branch. Two gaps worth considering: (1) a test asserting reconcile behavior when one acquire fails but other databases should still be refreshed (the starvation point above), and (2) the `LEADER_MISSING` -> alert path is unit-tested at the `ClusterAlerts` level but not the `ArcadeStateMachine` transition that populates it. ConventionsMatches the repo style well (final params, no fully-qualified names, `assertThat(...)` AssertJ syntax, Apache headers, no Claude attribution in sources). The `SnapshotAcquireNewDatabaseIT` is correctly `IT`-suffixed so it stays out of the normal `mvn test` run. Note: I could not run `mvn` in this sandbox, so I reviewed statically and relied on the stated "355 tests, 0 failures" for the suite result. 🤖 Generated with Claude Code |
…ble leader DBs Addresses the third review round: 1. A failing acquire no longer starves refresh of healthy databases. The reconcile execution is extracted into a pure, operation-injected ArcadeStateMachine.executeReconcilePlan(plan, acquireOp, refreshOp) that runs every database with per-database failure isolation: a persistently corrupt new DB (fails validation each pass) no longer aborts before the already-replicated databases are refreshed. Per-DB failures are collected, statuses applied from the outcome, and the overall install still fails at the end so Ratis re-triggers (installs are idempotent). 2. Filter leader databases the leader reported as unreadable (lastTxId < 0): such a database is not a usable snapshot source, so it is skipped (and logged) rather than repeatedly attempted and failed. 3. Minor: documented that the read-only validate-in-staging open derives the name from the directory and leaves no artifact carried across the atomic rename (covered by SnapshotAcquireNewDatabaseIT); noted the address suppliers are intentionally constant for a fixed InstallSnapshot leader. Tests: ReconcilePlanTest now also asserts a failing acquire still lets healthy refreshes run (no starvation) and the all-succeed path.
Review: HA auto-acquire of never-seen databases (#4727)Solid, well-engineered PR. The crash-safety design (stage under a reserved A few things worth a look before merge: Potential issues1. A persistently-failing acquire hot-loops the whole reconcile (medium). 2. 3. No assertion that an acquired DB keeps receiving replication (low, test gap). Minor / nits
SecurityNo new concerns. TestsGood coverage of the pure logic ( Overall: in good shape - issue #1 is the one I'd most want addressed (or explicitly accepted) before merge. Reviewed with Claude Code. |
…fast guard Addresses the fourth review round: 1. A persistently-failing acquire no longer hot-loops the whole reconcile. A database that opens on the leader but fails validateSnapshotOpens here would, under the always-throw policy, make Ratis re-trigger InstallSnapshot every pass and re-download every healthy database on this follower. Now consecutive acquire/refresh failures are counted per database; after ACQUIRE_GIVE_UP_AFTER (3) in a row that database stops forcing the retry (left FAILED, surfaced in cluster status, retried on the next natural InstallSnapshot). The counter resets on any success and is pruned with the status map. 2. validateSnapshotOpens now opens with the server's security + auto-transaction so the read-only validation matches the settings of the eventual live open in ArcadeDBServer.getDatabase rather than diverging under defaults. 3. acquireNewDatabase now fails fast (throws) if an acquisition for the same database is already in flight, instead of proceeding and risking two acquisitions racing on the staging dir / atomic rename. Tests: SnapshotAcquireNewDatabaseIT now also asserts a write made on the leader after the acquisition replicates to the newly-registered follower copy (ongoing replication, not just the one-time pull).
Review of #4739 - HA auto-acquire of never-seen databasesReviewed the full diff against the repo conventions. This is a well-engineered fix with a clear root-cause analysis, strong crash-safety design, and good separation of pure/testable logic. Summary below. Strengths
Concerns / suggestions
SecurityNo issues spotted. Peer RPCs reuse the cluster token + forwarded root user; the new Studio endpoints ( Style/conventionsConsistent with the codebase: Nice work overall - the crash-safety and additive-only design are the right instincts for an HA data path. The main thing I would push on before merge is closing the test gap on the reconcile bookkeeping (point 1).
|
…+ fallback parity Addresses the fifth review round: 1. Extracted the post-executeReconcilePlan bookkeeping (status-map transitions, failure-counter updates, and the give-up/retry-worthwhile decision) into a package-private ArcadeStateMachine.applyReconcileOutcome(plan, outcome). It has no leader/network I/O, so the ACQUIRED/FAILED/LEADER_MISSING transitions, the refresh-clears-stale-LEADER_MISSING behavior, and the give-up-after-N decision are now directly unit-tested (ReconcilePlanTest) - closing the gap where this orchestration glue was only exercised via injected ops. 2. Presence fan-out now uses a short 5s per-peer timeout instead of HA_BOOTSTRAP_TIMEOUT_MS (120s), so a peer that accepts the connection but hangs cannot tie up an Undertow worker for minutes x N peers. 3. refreshExistingDatabases (the legacy/fallback path) now skips reserved databases (RESERVED_DATABASE_PREFIX) for symmetry with the auto-acquire path. 4. Documented that a long-term-FAILED database is still attempted on the next natural InstallSnapshot (which re-downloads everything anyway) so it can recover if the leader's copy is fixed; it just no longer hot-loops. Tests: ReconcilePlanTest adds direct coverage of applyReconcileOutcome - status transitions, stale-LEADER_MISSING clearing on refresh, and the give-up threshold (retry forced until N consecutive failures, then suppressed; reset on success).
Code Review - PR #4739: HA auto-acquire of databases a joining node has never seenReviewed the full diff against What's done well
Suggestions / minor issues
VerdictSolid, defensive, and well-tested. Item (1) is the only one I'd suggest addressing before merge (small change, completes the failure-isolation guarantee); the rest are minor/optional. Automated review by Claude. Treat as advisory. |
Addresses the sixth review round: 1. executeReconcilePlan now isolates Exception (not just IOException) per database. An unchecked failure escaping one database's acquire/refresh (e.g. an NPE) would otherwise propagate and abort the whole reconcile, defeating the no-starvation guarantee. Each is now recorded as a failure and the others proceed. 2. Cleaner retry IOException message (acquire=.. , refresh=..) instead of two concatenated maps. 3. Documented the boundary that auto-acquire runs only on the follower InstallSnapshot path (Raft's up-to-date-log election rule + the LEADER_MISSING alert / leadership-transfer flow keep it unreachable in normal operation). 4. Studio: a hint under "Load presence matrix" that it queries each node and a node may briefly open a closed database to report it - run on demand. 5. Noted HttpClient-per-HTTPS-call is fine for these infrequent paths; cache an SSL client if it ever moves onto a hot path. Tests: ReconcilePlanTest adds a runtime-exception isolation case (a RuntimeException from one acquire still lets the healthy databases refresh).
Review: HA auto-acquire of never-seen databases (#4727)Thanks for this - it's a well-structured fix for a real gap (empty StatefulSet nodes silently missing runtime-created databases). The crash-safety design and the deliberate extraction of pure, unit-testable logic stand out. Detailed feedback below. Strengths
Concerns / suggestions
TestsCoverage is strong for the new deterministic surface: reconcile set logic, no-starvation execution, status/give-up bookkeeping, endpoint scheme selection, staging crash recovery (incl. the combined acquire-staging + pending-swap pass), and an end-to-end Overall a careful, defensively-coded change. The main things I'd push on before merge are items #1/#2 (retry amplification + the silent FAILED gap), since they affect behavior under real failure rather than the happy path. Reviewed against CLAUDE.md conventions (final usage, brace style, no new deps, concurrency rules). |
…riant notes Addresses the seventh review round: 1. New cluster alert (failed-acquire-databases) for databases left FAILED after the acquire give-up. Because a given-up database is only retried on the next natural InstallSnapshot (which Ratis avoids via log replay), it could stay absent even after the leader's copy is fixed; the alert surfaces it so an operator can force a resync. Distinct from the LEADER_MISSING alert. 2. Documented the whole-install retry amplification in HA_AUTO_ACQUIRE_DATABASES: a persistently-failing database is retried a bounded number of times, and each retry re-downloads the other databases too (capped, so it cannot loop forever). 3. Added a one-line invariant comment at the publish delete+rename noting it rests on Ratis applying the log single-threaded relative to the snapshot-install path. 4. Clear acquireFailureCounts on become-leader (parallel to the LEADER_MISSING status cleanup); strengthened the validate-in-staging name-derivation note. 5. Presence-matrix Javadoc now records the CLAUDE.md concurrency rule for any future parallel fan-out (no common ForkJoinPool; dedicated bounded pool + PoolMetrics). Confirmed globalNotify renders via textContent (not HTML), so the Studio message strings with database names are injection-safe; verifyDatabase escapes its inputs. Tests: ClusterAlertsTest adds failed-acquire alert coverage (raised with names / absent when empty).
|
PR Review: HA auto-acquire of never-seen databases (issue 4727) Thorough, well-documented change. The crash-safety design is the strongest part, and the decomposition into pure, unit-testable units is exactly right for this kind of distributed-systems logic. Findings ordered by severity. Bug (functional): "Transfer leadership" peer picker never sends the selected peer In globalPrompt("Transfer Leadership", html, "Transfer", function(values) {
var peerId = values["transferPeerId"];
if (!peerId) return; // always undefined for a <select>, so this returns earlyBut Fix: read the select directly, consistent with the rest of Studio: Minor: stale FAILED status survives on a node that becomes leader
Behavior change worth calling out: legacy refresh path now skips reserved DBs The old loop in Things I verified that look correct
Performance / security
Test coverage Strong on the pure logic ( Nice work overall - once the Studio select capture is fixed, this is in good shape. |
…become-leader
Addresses the eighth review round:
1. Bug fix (functional): the "Transfer leadership" peer picker never sent the
selected peer. globalPrompt only captures input/textarea values, not <select>,
so values["transferPeerId"] was always undefined and the transfer silently did
nothing. Read the select directly with $("#transferPeerId").val(), matching the
convention used elsewhere in Studio.
2. Minor: become-leader cleanup now also clears FAILED acquisition statuses (not
just LEADER_MISSING), so the failed-acquire-databases alert does not linger on a
node that becomes leader, where the follower reconcile state is meaningless.
ACQUIRED history is kept.
The legacy refresh path now skipping reserved databases (flagged in review) is
intentional: the leader never serves reserved DBs as snapshots, so skipping them is
correct on both the auto-acquire and the autoAcquire=false fallback paths.
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | -136 |
🟢 Coverage 30.99% diff coverage · -7.43% coverage variation
Metric Results Coverage variation ✅ -7.43% coverage variation Diff coverage ✅ 30.99% diff coverage Coverage variation details
Coverable lines Covered lines Coverage Common ancestor commit (dc19716) 131833 98036 74.36% Head commit (b87b7ae) 164198 (+32365) 109902 (+11866) 66.93% (-7.43%) Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch:
<coverage of head commit> - <coverage of common ancestor commit>Diff coverage details
Coverable lines Covered lines Diff coverage Pull request (#4739) 355 110 30.99% Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified:
<covered lines added or modified>/<coverable lines added or modified> * 100%
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
PR Review: HA auto-acquire of databases a joining node has never seen (#4727)Reviewed the full diff. This is a well-structured, carefully-reasoned change. The decomposition into pure, package-private, unit-testable pieces ( Correctness / design
Performance
Test coverage
Minor / nits
Verified
Overall: solid, defensive, well-tested work that directly addresses the root cause. My main ask is an explicit decision on item #1 (success-to-Ratis while a DB is still missing) and confirmation on #2 (give-up now applies to the legacy refresh path). Reviewed by Claude (claude-opus-4-8). Compilation was not run in this review environment; relying on the PR's reported 355/0 ha-raft suite result. |
…StateMachine Pure refactor (no behavioral change) following #4727/#4739: moves the database reconciliation orchestration out of ArcadeStateMachine into a dedicated DatabaseReconciler collaborator, separating it from the core Ratis state-machine contract. Moved into DatabaseReconciler: - reconcileDatabasesFromLeader() and refreshExistingDatabases() - classifyReconcile(), executeReconcilePlan(), applyReconcileOutcome(), bumpFailureAndShouldRetry() - acquireStatuses / acquireFailureCounts state and ACQUIRE_GIVE_UP_AFTER - AcquireStatus, AcquireState, ReconcilePlan, ReconcileOutcome, DbInstallOp types - the become-leader follower-state cleanup (clearFollowerReconcileStatesOnBecomeLeader) ArcadeStateMachine now holds a single reconciler instance, delegates from notifyInstallSnapshotFromLeader/notifyLeaderChanged/setServer, and exposes it via getReconciler(). GetClusterHandler and ClusterAlerts read statuses through the reconciler. ReconcilePlanTest migrated to DatabaseReconcilerTest. (cherry picked from commit c6f25dee589003c2a11cb8011a032f7e2a4a9116)
Summary
Fixes #4727. A node joining a Raft HA cluster with an empty data directory never acquired databases it had not previously seen on disk - it only refreshed databases already present, so runtime-created/imported databases were silently missing on new nodes (the #4522 StatefulSet scale-up case: new pods came up with only the
defaultDatabasesentry).Now a joining node reconciles its local database set against the leader's and pulls (full snapshot install) any database it is missing.
Root cause
ArcadeStateMachine.notifyInstallSnapshotFromLeaderiteratedserver.getDatabaseNames()(local DBs only), so a brand-new node had nothing to install. There was no step that listed the leader's databases and pulled the missing ones.What changed
Core (engine / ha-raft)
arcadedb.ha.autoAcquireDatabases(default true), per-node, re-read each boot. It gates the join-time pull only; the create-time push viaINSTALL_DATABASE_ENTRYstays mandatory (core Raft consensus).notifyInstallSnapshotFromLeadernow enumerates the leader's databases (reusing the existingPOST /api/v1/cluster/bootstrap-stateRPC via the newLeaderDatabaseQuery) and installs the union of leader + local DBs - refreshing existing ones, acquiring never-seen ones.DROP_DATABASE_ENTRYdistinguishable from "the leader never had it"); it is flaggedLEADER_MISSING.SnapshotInstaller.acquireNewDatabase: crash-safe acquisition of a never-seen database. The download is staged under a reserveddatabases/.acquire-<name>/directory (skipped by the startup scan) and published with a single atomic rename, so a crash mid-download can never leave a half-writtendatabases/<name>/thatloadDatabaseswould try to open. Startup recovery cleans leftover.acquire-*staging dirs.Observability
acquireStatus(ACQUIRING/ACQUIRED/LEADER_MISSING/FAILED) inGET /api/v1/cluster.leader-missing-databasescluster alert (the How to replicate databases when new cluster nodes are added in Kubernetes ? #4522 aggravating factor: a node lacking a DB is elected leader).GET /api/v1/cluster?presence=trueper-node x per-database presence fan-out (gated so the auto-poll stays cheap).Studio Cluster tab
Tests
SnapshotAcquireStagingRecoveryTest(unit) - crash recovery of the reserved acquisition staging dir; verifies no phantom non-reserved database directory is left behind.SnapshotAcquireNewDatabaseIT- drivesacquireNewDatabaseend-to-end against a live leader (HTTP snapshot download → atomic publish → register), asserting the DB is acquired with all records and no leftover staging/marker files.ha-raftsuite: 355 tests, 0 failures.🤖 Generated with Claude Code