Skip to content

fix(#4331): abort crash recovery on WAL version gap and preserve WAL files#4356

Merged
robfrank merged 3 commits into
mainfrom
fix/4331-wal-version-gap-recovery
May 27, 2026
Merged

fix(#4331): abort crash recovery on WAL version gap and preserve WAL files#4356
robfrank merged 3 commits into
mainfrom
fix/4331-wal-version-gap-recovery

Conversation

@robfrank

Copy link
Copy Markdown
Collaborator

Closes #4331

checkIntegrity() passed ignoreErrors=true to applyChanges(). When a WAL page version was more than one ahead of the on-disk version (a version gap), the code logged a WARNING and called continue, skipping that page but continuing to apply sibling pages in the same multi-page transaction. The result was a torn transaction on disk. All WAL files were then dropped unconditionally, making the corruption permanent and undiagnosable.

The fix passes ignoreErrors=false so WALVersionGapException is thrown immediately on the first gap, stopping the entire transaction replay before any sibling pages are written. The exception is caught in checkIntegrity(): WAL files are closed without deletion (preserved for manual inspection at the database path), a SEVERE log is emitted, and a new WAL pool is created so the database remains writable after partial recovery.

Test plan

  • WALVersionGapRecoveryTest.applyChangesThrowsOnVersionGap: confirms WALVersionGapException is raised when the WAL page version skips ahead
  • WALVersionGapRecoveryTest.versionGapPageNotApplied: confirms the gapped page is not written to disk (no torn transaction)
  • WALVersionGapRecoveryTest.checkIntegrityPreservesWALOnVersionGap: integration test - kills the database, injects a crafted WAL file with a version gap, reopens (triggering checkIntegrity()), and verifies WAL files are still present on disk rather than dropped
  • All existing WAL and transaction tests continue to pass (TransactionManagerCloseWALFsyncTest, ApplyChangesPartialReplayTest, WALFileGetTransactionTest, 65 tests total)

@codacy-production

codacy-production Bot commented May 26, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 7 complexity

Metric Results
Complexity 7

View in Codacy

🟢 Coverage 79.31% diff coverage · -7.84% coverage variation

Metric Results
Coverage variation -7.84% coverage variation
Diff coverage 79.31% diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (61177a8) 127719 94366 73.89%
Head commit (a58ea0d) 159417 (+31698) 105294 (+10928) 66.05% (-7.84%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#4356) 29 23 79.31%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses issue #4331 by preventing the silent skipping of WAL version gaps during crash recovery. It modifies TransactionManager.checkIntegrity() to pass ignoreErrors=false to applyChanges(), catches the resulting WALVersionGapException, halts the recovery loop, and preserves the WAL files for manual inspection rather than dropping them. The feedback highlights critical issues in the implementation: updating lastTxId before applyChanges succeeds can lead to incorrect transaction ID increments or duplicate IDs if the first transaction fails; resetting transactionIds when lastTxId remains -1 can overwrite the valid persisted ID; and iterating over activeWALFilePool without null checks risks throwing a NullPointerException if any WAL file failed to initialize.

Comment on lines +274 to +284
lastTxId = lowerTxId;

applyChanges(walPositions[lowerTx], Collections.emptyMap(), true);
try {
applyChanges(walPositions[lowerTx], Collections.emptyMap(), false);
} catch (WALVersionGapException e) {
LogManager.instance().log(this, Level.SEVERE,
"Recovery aborted for database '%s': version gap in WAL (txId=%d). WAL files have been preserved for manual inspection. No further transactions will be replayed.",
null, database, lowerTxId);
walGapDetected = true;
break;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Updating lastTxId before applyChanges succeeds means that if a WALVersionGapException is thrown, lastTxId will be set to the failed transaction ID. This causes the next transaction ID to be incorrectly incremented past the failed transaction, leaving a gap. Additionally, if the very first transaction fails, lastTxId remains -1, which would reset transactionIds to 0 and overwrite the previously loaded persistedLastTxId, leading to duplicate transaction IDs and severe database corruption. We should only update lastTxId after applyChanges completes successfully.

          try {
            applyChanges(walPositions[lowerTx], Collections.emptyMap(), false);
            lastTxId = lowerTxId;
          } catch (WALVersionGapException e) {
            LogManager.instance().log(this, Level.SEVERE,
                "Recovery aborted for database '%s': version gap in WAL (txId=%d). WAL files have been preserved for manual inspection. No further transactions will be replayed.",
                null, database, lowerTxId);
            walGapDetected = true;
            break;
          }

Comment on lines 289 to 290
// CONTINUE FROM LAST TXID
transactionIds.set(lastTxId + 1);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

If no transactions were successfully applied from the WAL (i.e., lastTxId remains -1), we must not overwrite transactionIds with lastTxId + 1 (which would be 0). Doing so would overwrite the valid persistedLastTxId loaded during initialization, causing transaction ID reuse and corruption. Only update transactionIds if at least one transaction was successfully replayed.

        if (lastTxId != -1) {
          // CONTINUE FROM LAST TXID
          transactionIds.set(lastTxId + 1);
        }

Comment on lines +292 to 314
if (!walGapDetected) {
// REMOVE ALL WAL FILES
for (final WALFile file : activeWALFilePool) {
try {
file.drop();
LogManager.instance().log(this, Level.FINE, "Dropped WAL file '%s'", null, file);
} catch (final IOException e) {
LogManager.instance().log(this, Level.SEVERE, "Error on dropping WAL file '%s'", e, file);
}
}
} else {
// Close WAL files without deleting: preserve for manual inspection after gap detection
for (final WALFile file : activeWALFilePool) {
try {
file.close();
} catch (final IOException e) {
LogManager.instance().log(this, Level.WARNING, "Error on closing WAL file '%s'", e, file);
}
}
LogManager.instance().log(this, Level.SEVERE,
"WAL files for database '%s' have been preserved in '%s' for manual inspection.",
null, database, database.getDatabasePath());
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If any WAL file failed to initialize (e.g., due to a FileNotFoundException), its entry in activeWALFilePool will be null. Iterating over the pool and calling file.drop() or file.close() without a null check will throw a NullPointerException, abruptly aborting the recovery cleanup. Adding a null check ensures robust cleanup.

        if (!walGapDetected) {
          // REMOVE ALL WAL FILES
          for (final WALFile file : activeWALFilePool) {
            if (file != null) {
              try {
                file.drop();
                LogManager.instance().log(this, Level.FINE, "Dropped WAL file '%s'", null, file);
              } catch (final IOException e) {
                LogManager.instance().log(this, Level.SEVERE, "Error on dropping WAL file '%s'", e, file);
              }
            }
          }
        } else {
          // Close WAL files without deleting: preserve for manual inspection after gap detection
          for (final WALFile file : activeWALFilePool) {
            if (file != null) {
              try {
                file.close();
              } catch (final IOException e) {
                LogManager.instance().log(this, Level.WARNING, "Error on closing WAL file '%s'", e, file);
              }
            }
          }
          LogManager.instance().log(this, Level.SEVERE,
              "WAL files for database '%s' have been preserved in '%s' for manual inspection.",
              null, database, database.getDatabasePath());
        }

@claude

claude Bot commented May 26, 2026

Copy link
Copy Markdown

Review posted via file

@claude

claude Bot commented May 26, 2026

Copy link
Copy Markdown

Code Review

Overview

This PR fixes a real data-corruption bug in WAL crash recovery (fixes #4331). The root cause is well-diagnosed: checkIntegrity() passed ignoreErrors=true to applyChanges(), which caused a version gap to silently skip a single page while continuing to apply sibling pages in the same transaction - a torn transaction. All WAL files were then unconditionally dropped, making the corruption unrecoverable. The fix is conceptually correct.

Positive Aspects

  • The core fix (switching to ignoreErrors=false + catching WALVersionGapException) is minimal and targeted.
  • Preserving WAL files instead of dropping them on a gap is the right operational behavior.
  • The lastTxId != -1 guard for transactionIds.set() silently fixes a second bug: the old code would call set(0) when no transactions were replayed (e.g., when the first WAL transaction itself has a gap), resetting the counter to 1 and risking ID reuse. Good catch.
  • The null checks added to both WAL file iteration loops are correct - activeWALFilePool[i] is legitimately nullable (set to null when moved to the inactive pool).

Issues

  1. Duplicate SEVERE log messages

When a gap is detected, two separate SEVERE messages are emitted: one inside the catch block (has txId) and one after the close loop (has path). Consolidate into one message to avoid operator confusion - include both the txId and the path in the catch block message, and remove the second SEVERE log after the close loop.

  1. Test: database.close() after db.kill() is ambiguous

In checkIntegrityPreservesWALOnVersionGap, both db and database reference the same object. After db.kill(), calling database.close() on the already-killed instance may silently succeed or throw depending on internal state. Other TestHelper-based tests in the codebase use kill() OR close(), not both in sequence. Verify this does not cause resource leaks or spurious failures, and that the database = factory.open() reassignment at the end does not interfere with TestHelper.endTest() cleanup.

  1. Tests 1 and 2 are nearly identical

applyChangesThrowsOnVersionGap and versionGapPageNotApplied build the same gapped WAL page and both assert WALVersionGapException is thrown. The second only adds a page-version-unchanged assertion. The two tests could be merged into one. Keeping both as-is adds noise without meaningful additional coverage.

  1. writeGapWALFile uses hardcoded currentPageSize = 512

The injected gap WAL entry uses 512 for currentPageSize, which may not match the actual page size on disk. If applyChanges uses this field before reaching the version-gap check, the test could fail for a different reason or pass for the wrong reason. Consider using the actual page size already obtained earlier in the test (file.getPageSize()).

  1. Documentation file does not belong in the repo

docs/4331-wal-version-gap-recovery.md contains implementation notes already captured in the PR description. Per project conventions, analysis and fix-explanation docs are not committed to the repository. Please remove this file.

  1. Inline comments in TransactionManager.java are overly verbose

The multi-line block comments explaining the lastTxId != -1 guard and the transactionIds.set() call are longer than the code they annotate. One short line expressing the why (avoid ID reuse when no transactions were replayed) is sufficient.

Test Coverage Gap

The tests confirm WAL files are preserved after a gap but do not assert that records written before the gap are still readable after reopening the database. A post-reopen count or query assertion would confirm partial recovery is actually correct, not just that WAL files survive on disk.

Summary

The fix is sound and addresses a genuine corruption risk. Main items to address before merging:

  1. Remove docs/4331-wal-version-gap-recovery.md
  2. Consolidate the two duplicate SEVERE log messages into one (txId + path in the catch block)
  3. Verify the kill() + close() sequence in the integration test is safe with TestHelper lifecycle
  4. Use the real page size in writeGapWALFile instead of hardcoded 512

@claude

claude Bot commented May 26, 2026

Copy link
Copy Markdown

Code Review

The fix is well-targeted and the reasoning in the PR description is clear. The two transaction-ID bugs caught by the earlier Gemini review were correctly addressed. A few remaining issues:


Critical - NPE in walPositions initialization loop (incomplete null-check fix)

The Gemini review flagged missing null-checks on activeWALFilePool, and the response added null-checks to the drop and close loops. But the initialization loop right above them still dereferences without a guard:

// TransactionManager.java lines 247-251
final WALFile.WALTransaction[] walPositions = new WALFile.WALTransaction[activeWALFilePool.length];
for (int i = 0; i < activeWALFilePool.length; ++i) {
    final WALFile file = activeWALFilePool[i];
    walPositions[i] = file.getFirstTransaction();  // NPE if activeWALFilePool[i] is null
}

activeWALFilePool[i] is set to null when the constructor throws FileNotFoundException (lines 240-243), so file.getFirstTransaction() will NPE in that case. The fix needed here is the same pattern already used in the cleanup loops:

for (int i = 0; i < activeWALFilePool.length; ++i) {
    final WALFile file = activeWALFilePool[i];
    walPositions[i] = file != null ? file.getFirstTransaction() : null;
}

The main while loop already handles null entries via the walTx != null guard at line 262, so this one-liner fix is sufficient.


Minor - Two SEVERE messages are emitted for the same gap event

When a gap is detected, both the catch block (line 280) and the post-loop section (line 319) emit a SEVERE log. Operators will see two alarm-level lines referring to the same event. Fold the path info into the catch-block message and remove the second SEVERE:

} catch (WALVersionGapException e) {
    LogManager.instance().log(this, Level.SEVERE,
        "Recovery aborted for database '%s': version gap in WAL (txId=%d). WAL files preserved in '%s' for manual inspection. No further transactions will be replayed.",
        null, database, lowerTxId, database.getDatabasePath());
    walGapDetected = true;
    break;
}

Then the else-branch after the cleanup close loop can just log at FINE or be removed.


Minor - docs/4331-wal-version-gap-recovery.md should not be committed

Per project conventions, analysis and fix-explanation documents belong in the PR description, not in the repository tree. This file duplicates the PR description almost verbatim and adds permanent maintenance overhead. Please remove it.


Nit - Inline comments in TransactionManager.java are overly descriptive

The project style reserves comments for non-obvious WHY (hidden constraints, subtle invariants). Several of the new inline comments explain WHAT the code does rather than why - those should be dropped. The one explaining the lastTxId != -1 guard is borderline useful; the others (e.g. // Only advance lastTxId after a successful apply..., // Close WAL files without deleting:) describe behavior the code already expresses and add noise.


The core logic change is correct - switching to ignoreErrors=false, catching WALVersionGapException, preserving WAL files, and guarding transactionIds.set() are all the right moves. The NPE in the init loop is the only remaining correctness issue.

robfrank added 3 commits May 27, 2026 08:37
…files

checkIntegrity() passed ignoreErrors=true to applyChanges(), which on a
version gap logged a WARNING and continued to the next page in the same
transaction. Sibling pages were applied while the gapped page was skipped,
leaving a torn transaction on disk. Afterwards all WAL files were dropped,
making the inconsistency permanent and undiagnosable.

The fix passes ignoreErrors=false so WALVersionGapException is thrown on
the first gap, immediately aborting the entire transaction replay. The
exception is caught in checkIntegrity(): WAL files are closed without
deletion (preserved for manual inspection) and a SEVERE log is emitted.
A new WAL pool is always created so the database can continue writing after
partial recovery.
… reset when empty, null-check WAL files in cleanup
@robfrank robfrank force-pushed the fix/4331-wal-version-gap-recovery branch from 92d981b to a58ea0d Compare May 27, 2026 06:37
@claude

claude Bot commented May 27, 2026

Copy link
Copy Markdown

Code Review

Summary: This PR correctly fixes a real data-corruption bug: checkIntegrity() was calling applyChanges(ignoreErrors=true), which skipped gapped pages with continue but kept applying sibling pages of the same transaction, leaving torn transactions on disk. The core logic changes are sound.


Core logic - correct

  • Moving lastTxId = lowerTxId after a successful applyChanges (instead of before) is the right fix; a failed apply must not advance the transaction counter.
  • Guarding transactionIds.set(lastTxId + 1) with lastTxId != -1 is correct: without it, an all-gap recovery would reset the counter to 0 and risk transaction-id reuse against the persistedLastTxId loaded by the constructor.
  • Null-checks added in the WAL cleanup loops (if (file == null) continue;) address the FileNotFoundException path in createWALFilePool().

Issues

1. WAL preservation window is shorter than the log implies (important)

checkIntegrity() closes the gap WAL files without deleting them and logs a SEVERE that says they are "preserved for manual inspection." However, close() contains an OS-level sweep that deletes every .wal file in the database directory on a clean shutdown:

// TransactionManager.java close()
else {
    File[] walFiles = dir.listFiles((dir1, name) -> name.endsWith(".wal"));
    if (walFiles != null)
        Stream.of(walFiles).forEach(File::delete);   // deletes preserved gap WALs too

The preserved files are not in inactiveWALFilePool, so cleanWALFiles returns true and the sweep runs unconditionally. An operator who restarts the database service (clean shutdown + restart) will find no WAL files to inspect, contradicting the SEVERE message.

Consider either:

  • Appending to the log message: "WAL files will be removed on next clean shutdown - copy them now if manual inspection is needed."
  • Or renaming them to a non-.wal extension (e.g. .wal.bad) so they survive the sweep. This also avoids re-triggering recovery on a subsequent kill+open cycle if any new WAL is written between the gap and the clean close.

2. Pre-existing null-safety gap in the early close loop (minor, but the PR is adjacent)

Lines 226-234 iterate activeWALFilePool and call file.close() without a null check:

for (final WALFile file : activeWALFilePool) {
    try {
        file.close();   // NPE if createWALFilePool() left this slot null
    } catch (final IOException e) {
        // NullPointerException is not IOException - would escape
    }
}

The PR added null checks in the two cleanup loops lower in the same method but missed this one. Worth fixing in the same pass.

3. docs/4331-wal-version-gap-recovery.md should not be committed

Per CLAUDE.md: "NEVER create documentation files (*.md) or README files unless explicitly requested." The file also contains review-cycle notes (gemini-code-assist comments, loop timeouts) that belong in the PR description, not in the repository history. Please remove it.

4. Multi-line Javadoc in the test file

CLAUDE.md style: "don't add multi-line comment blocks - one short line max." The test class has a multi-paragraph Javadoc block, and writeGapWALFile has a Javadoc with a detailed format spec. Both should be reduced to a single short comment or removed.


Minor observations

  • applyChangesThrowsOnVersionGap and versionGapPageNotApplied are almost identical (same setup, same assertion, the second adds one extra page-version check). Merging them into a single test would reduce duplication without losing coverage.
  • The writeGapWALFile helper hard-codes the WAL binary layout. If the format changes, the injected file will be silently unreadable (treated as empty) rather than causing a gap, so the test will stop catching the regression it was written for. A short comment pointing to WALFile.writeTransactionToBuffer as the canonical format reference would help a future reader.
  • boolean walGapDetected = false; - explicit initialisation of a local variable is correct Java (locals have no default), no issue here.

Test coverage

The three new tests cover the key cases well:

  • applyChangesThrowsOnVersionGap - unit: exception is thrown
  • versionGapPageNotApplied - unit: on-disk version is unchanged
  • checkIntegrityPreservesWALOnVersionGap - integration: WAL files survive after reopen

One missing scenario: what happens on a second open after the gap (e.g. kill, gap injected, open, kill again, open again)? Given the OS-level sweep in close() this cycle terminates on the first clean close, but a test confirming that behaviour would add confidence.


Overall this is a solid fix for a genuine corruption path. Addressing the preservation-window documentation and removing the docs file are the two changes I would consider blocking.

@codecov

codecov Bot commented May 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 65.51724% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.37%. Comparing base (61177a8) to head (a58ea0d).

Files with missing lines Patch % Lines
...n/java/com/arcadedb/engine/TransactionManager.java 65.51% 6 Missing and 4 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4356      +/-   ##
============================================
- Coverage     64.72%   64.37%   -0.35%     
- Complexity      446      457      +11     
============================================
  Files          1647     1647              
  Lines        127719   127739      +20     
  Branches      27375    27380       +5     
============================================
- Hits          82660    82227     -433     
- Misses        33384    33945     +561     
+ Partials      11675    11567     -108     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@robfrank robfrank merged commit 90414ff into main May 27, 2026
26 of 30 checks passed
@robfrank robfrank deleted the fix/4331-wal-version-gap-recovery branch May 27, 2026 07:47
@robfrank robfrank added this to the 26.6.1 milestone May 27, 2026
@robfrank robfrank self-assigned this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TransactionManager.applyChanges(ignoreErrors=true) silently skips WAL version gaps on recovery

1 participant