Skip to content

fix(#4458): eliminate TS compaction/append deadlock causing WAL version gap on Raft followers#4460

Merged
robfrank merged 7 commits into
mainfrom
fix/4458-ts-raft-wal-page0-version-gap
Jun 2, 2026
Merged

fix(#4458): eliminate TS compaction/append deadlock causing WAL version gap on Raft followers#4460
robfrank merged 7 commits into
mainfrom
fix/4458-ts-raft-wal-page0-version-gap

Conversation

@robfrank

@robfrank robfrank commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Closes #4458

Summary

On Raft HA leaders, appendSamples() held compactionLock.readLock() through db.commit(). db.commit() calls waitForActiveRecordingSession(), which spins until the compaction recording session ends. The session ends only after Phase 4c (which needs compactionLock.writeLock()), but Phase 4c cannot acquire the write lock while any thread holds the read lock — a deadlock broken only by HA_QUORUM_TIMEOUT. When that timeout fired, TX_ENTRY shipped before the buffered SCHEMA_ENTRY, causing WALVersionGapException on the follower and forcing a snapshot resync.

Fix: on replicated databases, release compactionLock.readLock() before db.commit() so Phase 4c can always acquire the write lock without deadlock. If Phase 4c commits a page-0 clear between the lock release and the commit, the append gets a ConcurrentModificationException and retries transparently on the freshly-cleared page (at most 3 retries). Phase 0 also gains a CME retry loop to handle in-flight append commits. Standalone databases keep the original locking (read lock through commit) since they have no recording-session deadlock.

Files changed:

  • engine/.../TimeSeriesShard.java: serialize via appendLock first; on replicated DBs release compactionLock.readLock() before commit with a CME retry loop; add a Phase 0 CME retry loop; add TEST_PRE_PHASE4C_HOOK for deterministic testing
  • ha-raft/.../ArcadeStateMachine.java: add TEST_WAL_GAP_COUNTER for deterministic testing

Test plan

  • Issue4458AppendCompactionRaceTest (engine, embedded) - regression guard for concurrent appends + compaction on the standalone path (Javadoc notes the HA-only CME-retry path is exercised by the HA test below)
  • Issue4458TsWalVersionGapIT (ha-raft) - 2-node Raft HA, uses TEST_PRE_PHASE4C_HOOK to race appends against Phase 4c deterministically, verifies TEST_WAL_GAP_COUNTER == 0 on the follower
  • Full TimeSeries* engine suite (173 tests) - no regressions
  • TimeSeriesConcurrentInsertTest (48-thread concurrent insert + auto-compaction) - passes
  • TimeSeriesGrpcHaConcurrentInsertIT (existing 3-thread gRPC HA smoke test) - passes

…on gap on Raft followers

On Raft HA leaders, appendSamples() held compactionLock.readLock() through db.commit().
db.commit() calls waitForActiveRecordingSession(), which waits for the compaction recording
session to end. The session ends only after Phase 4c (which needs the write lock), but
Phase 4c cannot acquire the write lock while the read lock is held — a deadlock broken only
by the HA_QUORUM_TIMEOUT. When that timeout fired, TX_ENTRY shipped before the buffered
SCHEMA_ENTRY, causing WALVersionGapException on the follower and forcing a snapshot resync.

Fix: on replicated databases, release compactionLock.readLock() before db.commit() so
Phase 4c can always acquire the write lock without deadlock. If Phase 4c commits a
page-0 clear between the lock release and commit, the append gets a
ConcurrentModificationException and retries transparently on the freshly-cleared page.
Phase 0 also gains a CME retry loop to handle in-flight append commits. Standalone
databases keep the original locking (readLock through commit) since they have no
recording-session deadlock.

Tests:
- Issue4458AppendCompactionRaceTest: embedded CME-retry correctness
- Issue4458TsWalVersionGapIT: 2-node HA deterministic race via TEST_PRE_PHASE4C_HOOK,
  verifies zero WAL version gaps on the follower
@codacy-production

codacy-production Bot commented Jun 2, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 17 complexity

Metric Results
Complexity 17

View in Codacy

🟢 Coverage 67.61% diff coverage · -7.53% coverage variation

Metric Results
Coverage variation -7.53% coverage variation
Diff coverage 67.61% diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (8bd740d) 124323 92571 74.46%
Head commit (a31020d) 160094 (+35771) 107157 (+14586) 66.93% (-7.53%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#4460) 71 48 67.61%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses issue #4458 by resolving a deadlock between concurrent appends and compaction Phase 4c in Raft HA mode. The fix releases the compaction read lock before committing on Raft HA leaders and introduces transparent retries for ConcurrentModificationException when compaction races with appends. Additionally, tests and test-only hooks are added to verify the fix. The reviewer noted that the test-only hook TEST_PRE_REPLICATE_SCHEMA_HOOK is defined and cleaned up but never actually utilized in the tests, suggesting its removal if it is not required.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +147 to +152
/**
* Test-only hook. Fires in {@link #runWithCompactionReplication} just before {@code replicateSchema()}
* is called. Used to inject delays and verify that concurrent appends correctly wait for the
* recording session to end rather than timing out and shipping TX_ENTRY before the SCHEMA_ENTRY.
*/
static volatile Consumer<String> TEST_PRE_REPLICATE_SCHEMA_HOOK = null;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test-only hook TEST_PRE_REPLICATE_SCHEMA_HOOK is defined and cleaned up in the tests, but it is never actually set or utilized within Issue4458TsWalVersionGapIT or any other test in this pull request. If it is not required for deterministic testing, consider removing it to keep the codebase clean.

@claude

claude Bot commented Jun 2, 2026

Copy link
Copy Markdown

Code Review - Fix #4458: TS compaction/append deadlock causing WAL version gap on Raft followers

The root cause analysis is clear and the fix is well-reasoned. A few observations below.


Correctness

The deadlock analysis and fix are sound:

  • The original invariant ("hold read lock through commit so Phase 4c can never see an MVCC conflict") was broken in HA mode because commit() on a Raft leader calls waitForActiveRecordingSession(), which spins waiting for the compaction recording session - a session that can only end after Phase 4c, which needs the write lock. Classic ABBA deadlock.
  • Releasing the read lock before commit() on HA leaders breaks the cycle. The CME-retry path correctly handles the window where Phase 4c clears page-0 between the lock release and the append's commit.
  • Phase 0 CME-retry (up to 3 attempts, inside the write lock) handles in-flight HA appends that committed page-0 after Phase 0 began its transaction but before it commits. The "at most 1 in-flight commit outstanding at a time" invariant (due to appendLock) means one retry is almost always sufficient; 3 for safety is reasonable.

Import ordering

com.arcadedb.exception.ConcurrentModificationException is inserted between stdlib java.util imports. Conventional Java import order puts project imports before java.*. This may trip Prettier or a style check:

// Current (mixed)
import java.util.logging.Level;
import java.util.ArrayList;
import java.util.Arrays;
import com.arcadedb.exception.ConcurrentModificationException;  // <-- out of order
import java.util.HashSet;

Recursive retry vs. loop

appendSamplesAttempt calls itself recursively for retries. With a max depth of 3 this is harmless, but a simple loop would be easier to follow and avoids any stack growth concern:

private void appendSamplesAttempt(final long[] timestamps, final Object[]... columnValues)
    throws IOException {
  for (int attempt = 3; attempt > 0; attempt--) {
    compactionLock.readLock().lock();
    boolean readLockHeld = true;
    try {
      // ...same body...
    } catch (final ConcurrentModificationException e) {
      if (db.isTransactionActive()) db.rollback();
      if (attempt == 1)
        throw new IOException("Failed to append timeseries samples after compaction-race retries", e);
      // else continue loop
    } finally {
      if (readLockHeld) compactionLock.readLock().unlock();
    }
  }
}

Embedded test does not exercise the CME-retry path

Issue4458AppendCompactionRaceTest runs on an embedded (standalone) database where the read lock is held through commit() (the if (db.isReplicated()) branch is never taken). That means:

  • Phase 4c's write lock waits for all in-flight read-lock holders to release before it can run. There is no window where Phase 4c clears page-0 between a lock release and a commit - the lock is never released early.
  • The CME-retry path is never triggered.

The test is still a valid regression test for concurrent appends + compaction under the embedded engine, but the Javadoc says it "Verifies CME-retry preserves all data when Phase 4c races concurrent appends" - that claim only holds in HA mode. The HA integration test (Issue4458TsWalVersionGapIT) covers the actual CME-retry path. Worth updating the embedded test's Javadoc to clarify scope.


Minor: appendsBusy.countDown() inside the inner loop

for (int i = 0; i < appendsPerThread; i++) {
    appendsBusy.countDown(); // signal we're running (idempotent after first)
    ...
}

This calls countDown() 200 times per thread (1600 total for 8 threads) even though only the first call from each thread matters. Since CountDownLatch ignores calls below 0 it's functionally correct, but a single appendsBusy.countDown() before the loop (or an AtomicBoolean guard) would be cleaner and match the comment's intent.


Test cleanup

Both @AfterEach methods correctly null out the static test hooks - important since these are public static volatile fields that would otherwise bleed into subsequent tests.


Docs

The new docs/4458-ts-raft-wal-page0-version-gap.md is a clear write-up of the problem, fix, and verification plan. Good to have this alongside the code.


Overall: The fix is correct and well-tested for the HA path. The items above are mostly minor polish (import ordering, loop vs. recursion, embedded test Javadoc). No blocking issues found.

…rify embedded test scope

- Remove unused TEST_PRE_REPLICATE_SCHEMA_HOOK from RaftReplicatedDatabase (gemini)
- Move com.arcadedb.exception.ConcurrentModificationException import out of java.util block (claude)
- Convert recursive appendSamplesAttempt to an iterative retry loop (claude)
- Clarify Issue4458AppendCompactionRaceTest javadoc: standalone path does not exercise
  the HA-only CME-retry; HA path is covered by Issue4458TsWalVersionGapIT (claude)
- Hoist appendsBusy.countDown() out of the inner insert loop (claude)
@claude

claude Bot commented Jun 2, 2026

Copy link
Copy Markdown

Code Review - PR #4460: TS compaction/append deadlock fix

Overview

This PR fixes a deadlock triangle in Raft HA mode that caused WAL page-version gaps on followers, ultimately forcing costly snapshot resyncs. The root cause is well-diagnosed: appendSamples() held compactionLock.readLock() through db.commit(), which calls waitForActiveRecordingSession() - creating a three-way deadlock with Phase 4c. The fix (release the read lock before commit in HA mode, accept CME and retry) is sound.


Issues and Suggestions

Missing file from diff

The PR description lists RaftReplicatedDatabase.java as changed (adding TEST_PRE_REPLICATE_SCHEMA_HOOK), but this file does not appear in the diff. If the hook was removed or superseded, the description and docs file should be updated to avoid confusion.


Lock ordering change

The previous ordering was compactionLock.readLock -> appendLock. The new ordering is appendLock -> compactionLock.readLock (the read lock is now acquired inside the retry loop, under appendLock). Compaction holds only compactionLock.writeLock (no appendLock), so no cycle is introduced. This should be verified against any other lock pair in the system that involves either of these two locks, since a lock-ordering regression here would be silent until a deadlock occurs under load.


Phase 0 retry: fragile intermediate variable

int phase0Retrieved = -1;
for (int attempt = 0; attempt < 3; attempt++) {
    ...
    phase0Retrieved = pageCount;
    break;
    ...
}
snapshotDataPageCount = phase0Retrieved;

Control flow guarantees phase0Retrieved != -1 when snapshotDataPageCount = phase0Retrieved is reached (every path that doesn't break either returns or throws). But it is not enforced by construction - a future edit could silently introduce a path that sets snapshotDataPageCount = -1, causing subtle corruption in Phase 4. An assert phase0Retrieved >= 0 guard after the loop (inside the try block) would make the invariant explicit at zero production cost.


Inconsistent retry conventions

Phase 0 counts up (attempt < 3, last-attempt check: attempt == 2), while appendSamples counts down (attempt > 0, last-attempt check: attempt == 1). Not a bug, but mixing conventions in adjacent code increases cognitive load. Standardizing on one form would help future readers.


No logging on CME retry

When appendSamples catches a ConcurrentModificationException and loops, it retries silently. In production, a persistent CME storm (e.g., compaction firing more frequently than expected) would exhaust all 3 retries and throw an IOException with no prior diagnostic context in the log. A FINE/DEBUG-level log on each CME retry would make this much easier to diagnose:

LogManager.instance().log(this, Level.FINE,
    "CME on TS append (attempt %d/3) - retrying after Phase 4c race", 4 - attempt);

Test hooks in production code

public static volatile Runnable TEST_PRE_PHASE4C_HOOK and public static volatile AtomicInteger TEST_WAL_GAP_COUNTER appear in production classes. The pattern is apparently established in this codebase. Two small points:

  1. The volatile read is correctly snapshotted before the null check (final Runnable prePhase4cHook = TEST_PRE_PHASE4C_HOOK). Good.
  2. TEST_WAL_GAP_COUNTER starts as null and is promoted to a non-null AtomicInteger by the test - the null guard in ArcadeStateMachine is correct. A comment on the field noting that tests must call cleanupHooks() would prevent silent no-op failures in future tests that forget @AfterEach cleanup.

Standalone test does not exercise the fix

Issue4458AppendCompactionRaceTest is tagged @slow and documents in its own Javadoc that "the HA-only CME-retry path is NOT exercised here." This means the actual behavioral change (early read lock release + CME retry) is only covered by the HA integration test (Issue4458TsWalVersionGapIT). The standalone test is a useful regression guard for the unchanged path but provides no coverage of the new HA path. This is acceptable if the HA test is reliable, but worth flagging for reviewers: if CI skips @slow tests by default, the fix has no automated coverage in the standard build.


Docs file belongs in commit history, not source

docs/4458-ts-raft-wal-page0-version-gap.md is a PR-specific incident postmortem. It duplicates the PR body and will rot as the code evolves. This content belongs in the PR description (already present) or the commit message - not as a permanent source file that future readers must evaluate for currency.


Positive Notes

  • The deadlock analysis is correct and clearly explained. The three-way cycle (append holds readLock -> waits for session to end -> Phase 4c needs writeLock -> blocked by readLock) is exactly right.
  • The CME retry approach is sound: accept the conflict, roll back, re-read the freshly-cleared page. The retry limit of 3 is conservative and correct.
  • Test cleanup via @AfterEach properly nulls the hooks, preventing cross-test interference.
  • The appendLock now wraps the entire retry loop rather than just the inner body - the right scope to ensure only one append runs at a time across all retry attempts.
  • waitForAllServers() after schema creation in the HA test ensures the follower is synchronized before the race begins.
  • The isReplicated() guard correctly limits the early-release behavior to HA leaders only; standalone databases retain the original locking (read lock through commit) unchanged.

Summary

The core fix is correct and the HA integration test covers the critical scenario well. The main actionable items are: (1) clarify or remove the RaftReplicatedDatabase.java reference from the PR description/docs, (2) add a log line on CME retry for production diagnosability, (3) consider removing the docs file from source, and (4) add an assertion guard after the Phase 0 loop to protect the phase0Retrieved intermediate variable from future regressions.

…reset docs

- Add FINE-level log on each CME retry in appendSamples for production diagnosability
- Standardize Phase 0 retry loop to count-down convention, matching appendSamples
- Add assert guard after Phase 0 loop to make the phase0Retrieved invariant explicit
- Document on both TEST_* hook fields that tests must reset them in @AfterEach
@claude

claude Bot commented Jun 2, 2026

Copy link
Copy Markdown

Code Review - PR #4458 Fix: TS Compaction/Append Deadlock

The root-cause analysis and fix are sound. The deadlock chain (append holds readLock → waitForActiveRecordingSession spins → Phase 4c needs writeLock → blocked by readLock) is clearly explained and the solution is well-structured.

Potential Issues

1. Documentation mentions RaftReplicatedDatabase.TEST_PRE_REPLICATE_SCHEMA_HOOK but it's not in the diff

docs/4458-ts-raft-wal-page0-version-gap.md lists:

ha-raft/.../RaftReplicatedDatabase.java - TEST_PRE_REPLICATE_SCHEMA_HOOK added for deterministic testing

No such change appears in the diff, and the HA test (Issue4458TsWalVersionGapIT) doesn't use it either. Either this was planned and dropped, or it's a stale copy-paste. The doc should be corrected to avoid misleading future readers.

2. assert for a production invariant in TimeSeriesShard.java

assert phase0Retrieved >= 0 : "Phase 0 exited the retry loop without a valid page count";
snapshotDataPageCount = phase0Retrieved;

Java assertions are disabled by default (no -ea JVM flag in the server startup scripts). If the loop somehow exits without setting phase0Retrieved (e.g. a future code change breaks the invariant), this silently assigns -1 to snapshotDataPageCount and the bug propagates. Consider replacing with an explicit guard:

if (phase0Retrieved < 0)
  throw new IllegalStateException("Phase 0 exited retry loop without a valid page count");

3. Unused committed counter in Issue4458AppendCompactionRaceTest

The AtomicInteger committed is incremented on each successful insert but is never read in any assertion - the test correctly validates the actual database count instead. The field can be removed.

Minor Points

4. phase0Retrieved variable name is slightly awkward since it doesn't describe what was retrieved. A name like snapshotPageCount would match the existing local variable naming in that method.

5. Double-space alignment in the HA test:

final List<Throwable> errors  = new ArrayList<>();
final ExecutorService exec    = Executors.newFixedThreadPool(appendThreads + 1);

The existing codebase doesn't use this alignment style (single space after the type). Minor but inconsistent with surrounding code.

6. HA_QUORUM_TIMEOUT escape hatch remains the fallback - The PR description says the timeout was the deadlock escape, and the fix removes the need for it. However, waitForActiveRecordingSession() still has the timeout. Is there now a scenario where an append reaches waitForActiveRecordingSession() (without holding the readLock) and the recording session legitimately never ends? A brief comment clarifying this residual timeout is still correct and intentional would help maintainers.

What's Done Well

  • The race condition analysis in the Javadoc of appendSamples() is excellent - it clearly documents why the early readLock release is safe on HA, and why standalone keeps the original behaviour.
  • Proper lock ordering is maintained: appendLock is always acquired before compactionLock.readLock(), eliminating any lock-order inversion risk.
  • The readLockHeld boolean idiom correctly prevents double-unlock in all code paths.
  • Both test classes properly reset static hooks in @AfterEach, preventing test pollution across the JVM.
  • @Tag("slow") is correctly applied to both test classes per project conventions.
  • The 3-attempt retry limit is conservative and well-justified by the "at most one in-flight commit" reasoning.
  • AssertJ assertions and final keywords are used throughout, consistent with project style.
  • The TEST_WAL_GAP_COUNTER pattern (read the reference once into a local before the null check) correctly handles concurrent null-setting without synchronization.

- Replace Phase 0 assert with an explicit IllegalStateException guard (assertions are
  disabled by default in production, so the assert would not protect the invariant)
- Rename phase0Retrieved -> capturedPageCount for clarity
- Document that waitForActiveRecordingSession's HA_QUORUM_TIMEOUT is an intentional safety
  net retained after the #4458 source fix, not dead code
- Correct docs/4458 to drop the stale RaftReplicatedDatabase/TEST_PRE_REPLICATE_SCHEMA_HOOK ref
- Remove unused 'committed' counter and AtomicInteger import from the embedded test
- Normalize double-space alignment to single-space in the HA test
@claude

claude Bot commented Jun 2, 2026

Copy link
Copy Markdown

Code Review: fix(#4458) - TS compaction/append deadlock causing WAL version gap on Raft followers

This is a well-reasoned fix for a real production bug. The root-cause analysis is correct and the solution is sound. Notes below, roughly in order of importance.


Correctness / Thread Safety

appendSamples retry loop - lock-state tracking looks correct.
The readLockHeld = true initializer at the top of each iteration plus the finally { if (readLockHeld) unlock(); } guard is the right pattern. On a CME retry, the previous iteration already set readLockHeld = false before the failed commit, and the finally therefore skips the redundant unlock. On the next pass, the lock is re-acquired and the flag is reset to true. No double-unlock risk.

Phase 4c hook read into a local - good.

final Runnable prePhase4cHook = TEST_PRE_PHASE4C_HOOK;
if (prePhase4cHook != null)
  prePhase4cHook.run();

Reading the volatile into a local before the null-check eliminates the TOCTOU race where a test could null the field between the check and the call. Correct pattern.

db.begin() failure bypasses the retry loop.
If db.begin() throws, the exception propagates out of the try block, through the finally (which releases the read lock), and skips the CME catch. This means begin-failures are not retried. That is acceptable - begin failures are not the scenario this retry is designed for - but worth noting explicitly.


Potential Issues

appendsBusy latch counts down before transactions start (standalone test).
In Issue4458AppendCompactionRaceTest, append threads decrement appendsBusy immediately after hookFired.await() returns, before they call database.transaction(). The hook then sleeps 50 ms and lets Phase 4c proceed. Those 50 ms are the only window between "threads started" and "threads have opened a DB transaction." Under a slow CI runner, threads may not yet be inside db.begin() when Phase 4c runs, reducing the concurrency window the test is meant to exercise. Consider using a second CountDownLatch that is counted down from inside each append transaction (e.g., after mutableBucket.appendSamples) to guarantee actual in-flight transactions before Phase 4c acquires the write lock.

(In standalone mode the CME-retry path is not taken anyway, so this does not affect correctness of the test - but it does affect whether the race is reliably triggered.)

capturedPageCount < 0 guard is unreachable by static analysis.
Every loop path either breaks (sets the value), returns from the method, or throws. The comment explains the choice of throw over assert for production visibility, which is a reasonable call. Just flagging it in case the build toolchain's dead-code checker emits a warning.


Minor / Style

Comment verbosity vs. project conventions.
CLAUDE.md asks for comments only when the WHY is non-obvious. The locking semantics here are genuinely non-obvious, so in-code explanation is warranted. That said, some of the longer comment blocks (e.g., the Phase 0 retry paragraph) duplicate what is already in the docs/ file and could be condensed to a pointer.

Copyright header inconsistency.
Issue4458AppendCompactionRaceTest.java has Copyright © 2021-present ... (with the © symbol), while Issue4458TsWalVersionGapIT.java has Copyright 2021-present ... (without it). Existing files in the repo include the symbol; the IT test header should match.

docs/4458-ts-raft-wal-page0-version-gap.md placement.
Issue-specific design notes are typically better captured in commit messages or PR descriptions rather than committed to docs/. Once the fix ships, future readers searching for "TimeSeries compaction" will not naturally land on a file named after an issue number. Consider whether this belongs in docs/ or can be dropped in favour of the PR description.

Log level for CME retries.
Level.FINE is correct for expected-transient events. Good choice - won't alarm operators in production.


Test Coverage

Both scenarios are covered:

  • Issue4458AppendCompactionRaceTest - standalone path, verifies no data loss under concurrent append + compaction.
  • Issue4458TsWalVersionGapIT - 2-node Raft path, verifies TEST_WAL_GAP_COUNTER == 0 after the race; also checks follower convergence within 30 s.

The @AfterEach cleanup of both static hooks is present and correct in both test classes.


Summary

The fix is correct and the approach (release read lock early on replicated databases, retry on CME) is the right solution given the constraint that waitForActiveRecordingSession() must be called without holding the compaction read lock. The main actionable item before merge is the appendsBusy latch timing concern in the standalone test; the copyright nit and doc placement are low-priority. Everything else is informational.

…n embedded test

Count the appendsInFlight latch down from INSIDE the first append transaction (after the
write, before commit) instead of right after the thread starts. This guarantees the append
threads have an open transaction in flight when Phase 4c proceeds, making the
append/compaction overlap reliable regardless of CI runner scheduling latency.
@claude

claude Bot commented Jun 2, 2026

Copy link
Copy Markdown

Code Review - fix(#4458): TS compaction/append deadlock causing WAL version gap on Raft followers

The root cause analysis is precise and the fix is surgically targeted. Overall this is solid work. A few observations below.


Correctness

Lock ordering reversal (important to verify)

Before this fix, the acquisition order in appendSamples() was:

  1. compactionLock.readLock()
  2. appendLock

After the fix it is:

  1. appendLock
  2. compactionLock.readLock()

Lock-ordering inversions between two threads are the classic deadlock recipe. Since appendLock is private and not shared with any compaction code path, this looks safe - but it is worth adding an explicit comment near the field declarations (or in the class-level Javadoc) stating the canonical acquisition order, e.g.:

// Lock ordering: appendLock MUST be acquired before compactionLock on any code path
// that takes both. Violating this order will cause a deadlock.

A quick grep for any other site that acquires compactionLock first and then appendLock (even indirectly) would further confirm safety.


Retry-attempt log message arithmetic

LogManager.instance().log(this, Level.FINE,
    "CME on TimeSeries append for shard %d (attempt %d/3) - retrying after compaction Phase 4c race",
    shardIndex, 4 - attempt);

When attempt = 3 the log says "attempt 1/3", and when attempt = 2 it says "attempt 2/3" - the arithmetic is correct but the countdown convention makes it non-obvious. Using an ascending counter (attemptNumber = 4 - attempt) as a named variable would make it clearer at a glance, or simply use an ascending loop variable.


Phase 0 capturedPageCount sentinel pattern

Using a capturedPageCount = -1 sentinel and a trailing guard:

if (capturedPageCount < 0)
    throw new IllegalStateException("Phase 0 exited the retry loop without a valid page count");

...works, but it is slightly fragile - any future refactor that adds a continue without setting capturedPageCount will silently propagate the -1. An Optional<Integer> or simply initializing snapshotDataPageCount to 0 with an explicit early-return on pageCount == 0 would make the invariant structurally impossible to violate. That said, the defensive guard is appreciated and the comment explains it well.


Testing

appendsInFlight latch timing in Issue4458AppendCompactionRaceTest

The latch counts down from inside the transaction lambda (before the implicit commit). In standalone mode the read lock IS held at that point, which is exactly the intended setup - the hook can confirm all readers hold their locks before sleeping. This is clever and correctly documented.

One edge case: if a thread fails on its very first insert (e.g., schema error), appendsInFlight may never reach zero, causing the hook to block for the full 10-second await timeout before proceeding. The test would still pass or fail correctly afterward; it just adds latency. Given that schema creation happens before the thread pool, this is unlikely in practice.

HA test data-integrity check

Issue4458TsWalVersionGapIT verifies followerCount == leaderCount but does not check that the counts equal initialInserts + appendThreads * appendsPerThread. If the leader itself lost data (due to retry exhaustion), both nodes would agree on the lower count and the assertion would pass. This is acceptable for a WAL-gap regression test, but adding a minimum-count lower bound would make it a stronger data-integrity guard:

assertThat(followerCount)
    .as("follower converges to leader count").isEqualTo(leaderCount);
assertThat(leaderCount)
    .as("no samples lost on leader")
    .isGreaterThanOrEqualTo(250 + appendThreads * appendsPerThread);

Minor style notes

  • docs/4458-ts-raft-wal-page0-version-gap.md: Per project guidelines, markdown docs are not added unless explicitly requested. The design explanation is valuable but lives more naturally in the PR description (which already contains it verbatim). Consider removing the file from the commit.

  • TEST hook fields (TEST_PRE_PHASE4C_HOOK, TEST_WAL_GAP_COUNTER) as public static volatile are pragmatic for concurrency testing, and the @AfterEach cleanup requirement is clearly documented. Conventionally these are annotated @VisibleForTesting or placed in a test-support class; not a blocker, just a note.

  • RaftReplicatedDatabase.java Javadoc addition - the note explaining why waitForActiveRecordingSession() is not dead code is useful and well-placed.


Summary

The fix is correct and the tests are well-designed to exercise the race deterministically. The main item worth addressing before merge is adding explicit lock-ordering documentation to prevent a future regression; the others are minor polish suggestions.

robfrank added 2 commits June 2, 2026 14:41
…a-integrity assert

- Document the canonical lock-ordering invariant (appendLock before compactionLock) at the
  field declarations to prevent a future lock-inversion regression
- Name the ascending retry counter in the CME log message instead of inline 4 - attempt
- HA test: assert the leader retains the full expected sample count (no leader-side loss),
  not merely that the follower agrees with the leader
@claude

claude Bot commented Jun 2, 2026

Copy link
Copy Markdown

Code Review - fix(#4458): TS compaction/append deadlock causing WAL version gap on Raft followers

Summary: Well-documented fix for a real concurrency bug. The root cause analysis is accurate, the fix is targeted, and the test coverage is solid. A few observations below.


Correctness - Looks good

The deadlock chain is correctly identified and the fix breaks it at the right point:

  • Lock ordering is now documented and enforced: appendLock first, then compactionLock (previously the nesting was reversed, which is the pattern that enables deadlocks).
  • Standalone vs. HA paths are cleanly separated via db.isReplicated(). Standalone keeps the old lock-through-commit behavior (no recording session to deadlock with), HA releases before commit.
  • MVCC retry logic is sound: a ConcurrentModificationException after releasing the read lock means Phase 4c won already, so retrying on the freshly cleared page is safe.
  • The IllegalStateException guard for capturedPageCount < 0 replacing a disabled-by-default assert is the right call.

Potential issues

1. The embedded test does not exercise the new code path

Issue4458AppendCompactionRaceTest explicitly states in its Javadoc that the db.isReplicated() early-release branch is never taken (standalone mode always holds the lock through commit). The CME-retry loop - the most critical new code - is only exercised by the HA integration test.

This means the embedded test is a regression guard for the old standalone path, not the new fix. That's worth a comment in the test name or a method-level note so future maintainers don't assume it covers the HA path. The HA IT is the real regression guard here.

2. readLockHeld boolean is fragile

compactionLock.readLock().lock();
boolean readLockHeld = true;
try {
    ...
    if (db.isReplicated()) {
        compactionLock.readLock().unlock();
        readLockHeld = false;
    }
    try {
        db.commit();
        return;
    } catch (final ConcurrentModificationException e) { ... }
} finally {
    if (readLockHeld)
        compactionLock.readLock().unlock();
}

If someone adds code between the readLockHeld = false assignment and the commit try-block, they could accidentally suppress the unlock. A small refactor that might be more readable: move the early-release into its own helper or document the pattern with an inline comment. Not blocking, but worth noting for future maintainability.

3. Magic retry count 3

Both appendSamples() and Phase 0 retry loops use the literal 3. The comment says "one retry is sufficient in practice; the loop uses 3 for safety". A named constant would make the intent clearer and ensure consistency if either value is changed independently later:

private static final int COMPACTION_RACE_RETRY_LIMIT = 3;

4. Test hooks in production code

TEST_PRE_PHASE4C_HOOK and TEST_WAL_GAP_COUNTER are public static volatile fields in production classes. The PR description acknowledges this matches the existing TEST_POST_REPLICATION_HOOK pattern and defers it. Acceptable given the project pattern, but the null-check in every compaction Phase 4c execution is worth noting. The performance impact is negligible (single volatile read), but the pattern does add test surface to the production class.


Minor observations

  • attemptNumber = 4 - attempt is clever but non-obvious. The comment // ascending 1..3 for human-readable logging helps, but an incremented counter would be more readable:

    for (int attempt = 0; attempt < 3; attempt++) {
        // attempt is 0-based; log as attempt+1
    }

    Not a bug, just a readability nit.

  • Thread.sleep(50) in the embedded test hook - This fixed-sleep is noted as addressed in cycle 4 (signals from inside the transaction). It is still a timing assumption; if CI is pathologically slow the 50ms window may not be enough. The appendsInFlight latch mitigates most of the risk by ensuring threads have open transactions before the sleep, which is the right design.

  • Docs file - docs/4458-ts-raft-wal-page0-version-gap.md contains review cycle history and a PR link. These are useful as an artifact but will go stale. The deferred rationale ("established repo convention") is reasonable.


Test coverage assessment

Scenario Covered by
Standalone: concurrent appends + compaction, no data loss Issue4458AppendCompactionRaceTest
HA: Phase 4c race causes WAL gap on follower Issue4458TsWalVersionGapIT
HA: leader data integrity after race Issue4458TsWalVersionGapIT
HA: follower convergence after race Issue4458TsWalVersionGapIT
HA: CME-retry path in appendSamples() Issue4458TsWalVersionGapIT (implicitly, if gap was eliminated)
Phase 0 CME-retry path Not directly tested (race window is very narrow)

The Phase 0 CME-retry path (the new loop in compactInternal() Phase 0) is not directly exercised by either test. Given the very tight window this race requires, this is a pragmatic gap - but it is new production code. Consider adding a comment noting it is defensive and expected to be triggered rarely.


Overall

Solid fix for a subtle HA concurrency bug. The lock ordering change and the replicated-vs-standalone branching are correct. The main actionable suggestions are: (1) clarify in the embedded test that it does not cover the new HA code path, (2) extract the retry limit as a named constant. No blocking issues.

@claude

claude Bot commented Jun 2, 2026

Copy link
Copy Markdown

Code Review - TimeSeriesShard deadlock fix

The fix correctly targets the deadlock. Root cause is accurate: read lock held across commit() which calls waitForActiveRecordingSession(), creating a cycle with Phase 4c write lock.

CORRECTNESS - no blocking issues:

  • Lock ordering is now correctly inverted: appendLock (outer) then compactionLock.readLock() (inner). Old nesting was reversed.
  • Standalone vs. HA branching via db.isReplicated() is clean. Standalone keeps old semantics, HA takes the early-release path.
  • CME retry on both appendSamples() and Phase 0 is bounded (3 attempts) and throws IOException on exhaustion.
  • IllegalStateException guard for capturedPageCount < 0 is better than a disabled-in-production assert.
  • TEST_PRE_PHASE4C_HOOK local variable capture before run() prevents a TOCTOU race if a test resets the hook between the null check and the call.

OBSERVATIONS:

  1. HA test synchronization is timing-based; embedded test is latch-based.
    The embedded test uses appendsInFlight.await() to guarantee threads have open transactions before Phase 4c proceeds. The HA test uses Thread.sleep(200) instead. On a heavily loaded CI runner this may not be enough for 4 threads to reach waitForActiveRecordingSession(). The test would still pass (no gap even if the race misses), but it would test vacuously. Consider counting down a latch from inside each thread first INSERT, as the embedded test does.

  2. Phase 0 CME-retry path has no direct test.
    The retry window (in-flight append commit racing between Phase 0 begin and commit while write lock is held) is very narrow. A brief comment noting this path is defensive and expected to trigger rarely would help future maintainers understand why there is no direct test.

  3. readLockHeld boolean.
    The pattern works, but future refactoring that inserts code between readLockHeld=false and the commit try-block could break the unlock invariant. A comment explaining the flag purpose (avoids double-unlock when early-release already happened) would help.

  4. Retry count literal.
    Both loops hard-code 3. A named constant (COMPACTION_RACE_MAX_RETRIES) would make intent explicit and keep the two sites in sync.

MINOR NITS:

  • attemptNumber = 4 - attempt is correct (1,2,3 ascending). A forward-counting loop would read more naturally.
  • Docs file contains review-cycle history and PR URL that will go stale - acceptable per stated repo convention.

TEST COVERAGE:
Standalone concurrent append+compaction data integrity: Issue4458AppendCompactionRaceTest
HA no WAL gap on follower during Phase 4c race: Issue4458TsWalVersionGapIT
HA leader data integrity + follower convergence: Issue4458TsWalVersionGapIT
CME-retry in appendSamples() HA path: covered implicitly by the HA IT
Phase 0 CME-retry in compactInternal(): not directly tested

SUMMARY: No blocking issues. The core fix is correct. Main suggestions: (1) latch-based synchronization in the HA test to avoid vacuous pass on slow CI, and (2) a comment on the Phase 0 retry path being defensive and untested-by-design.

@codecov

codecov Bot commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 53.52113% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.28%. Comparing base (8bd740d) to head (a31020d).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...om/arcadedb/engine/timeseries/TimeSeriesShard.java 55.22% 20 Missing and 10 partials ⚠️
...om/arcadedb/server/ha/raft/ArcadeStateMachine.java 25.00% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##               main    #4460     +/-   ##
===========================================
  Coverage     65.27%   65.28%             
- Complexity        0      594    +594     
===========================================
  Files          1612     1649     +37     
  Lines        124323   128388   +4065     
  Branches      26912    27540    +628     
===========================================
+ Hits          81150    83812   +2662     
- Misses        31780    32743    +963     
- Partials      11393    11833    +440     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@robfrank robfrank merged commit 37682c5 into main Jun 2, 2026
26 of 30 checks passed
@robfrank robfrank deleted the fix/4458-ts-raft-wal-page0-version-gap branch June 2, 2026 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky WAL page-version-gap on TimeSeries shard page-0 under concurrent inserts + background compaction in Raft HA

1 participant