fix(#4331): abort crash recovery on WAL version gap and preserve WAL files#4356
Conversation
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 7 |
🟢 Coverage 79.31% diff coverage · -7.84% coverage variation
Metric Results Coverage variation ✅ -7.84% coverage variation Diff coverage ✅ 79.31% diff coverage Coverage variation details
Coverable lines Covered lines Coverage Common ancestor commit (61177a8) 127719 94366 73.89% Head commit (a58ea0d) 159417 (+31698) 105294 (+10928) 66.05% (-7.84%) Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch:
<coverage of head commit> - <coverage of common ancestor commit>Diff coverage details
Coverable lines Covered lines Diff coverage Pull request (#4356) 29 23 79.31% Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified:
<covered lines added or modified>/<coverable lines added or modified> * 100%
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Code Review
This pull request addresses issue #4331 by preventing the silent skipping of WAL version gaps during crash recovery. It modifies TransactionManager.checkIntegrity() to pass ignoreErrors=false to applyChanges(), catches the resulting WALVersionGapException, halts the recovery loop, and preserves the WAL files for manual inspection rather than dropping them. The feedback highlights critical issues in the implementation: updating lastTxId before applyChanges succeeds can lead to incorrect transaction ID increments or duplicate IDs if the first transaction fails; resetting transactionIds when lastTxId remains -1 can overwrite the valid persisted ID; and iterating over activeWALFilePool without null checks risks throwing a NullPointerException if any WAL file failed to initialize.
| lastTxId = lowerTxId; | ||
|
|
||
| applyChanges(walPositions[lowerTx], Collections.emptyMap(), true); | ||
| try { | ||
| applyChanges(walPositions[lowerTx], Collections.emptyMap(), false); | ||
| } catch (WALVersionGapException e) { | ||
| LogManager.instance().log(this, Level.SEVERE, | ||
| "Recovery aborted for database '%s': version gap in WAL (txId=%d). WAL files have been preserved for manual inspection. No further transactions will be replayed.", | ||
| null, database, lowerTxId); | ||
| walGapDetected = true; | ||
| break; | ||
| } |
There was a problem hiding this comment.
Updating lastTxId before applyChanges succeeds means that if a WALVersionGapException is thrown, lastTxId will be set to the failed transaction ID. This causes the next transaction ID to be incorrectly incremented past the failed transaction, leaving a gap. Additionally, if the very first transaction fails, lastTxId remains -1, which would reset transactionIds to 0 and overwrite the previously loaded persistedLastTxId, leading to duplicate transaction IDs and severe database corruption. We should only update lastTxId after applyChanges completes successfully.
try {
applyChanges(walPositions[lowerTx], Collections.emptyMap(), false);
lastTxId = lowerTxId;
} catch (WALVersionGapException e) {
LogManager.instance().log(this, Level.SEVERE,
"Recovery aborted for database '%s': version gap in WAL (txId=%d). WAL files have been preserved for manual inspection. No further transactions will be replayed.",
null, database, lowerTxId);
walGapDetected = true;
break;
}| // CONTINUE FROM LAST TXID | ||
| transactionIds.set(lastTxId + 1); |
There was a problem hiding this comment.
If no transactions were successfully applied from the WAL (i.e., lastTxId remains -1), we must not overwrite transactionIds with lastTxId + 1 (which would be 0). Doing so would overwrite the valid persistedLastTxId loaded during initialization, causing transaction ID reuse and corruption. Only update transactionIds if at least one transaction was successfully replayed.
if (lastTxId != -1) {
// CONTINUE FROM LAST TXID
transactionIds.set(lastTxId + 1);
}| if (!walGapDetected) { | ||
| // REMOVE ALL WAL FILES | ||
| for (final WALFile file : activeWALFilePool) { | ||
| try { | ||
| file.drop(); | ||
| LogManager.instance().log(this, Level.FINE, "Dropped WAL file '%s'", null, file); | ||
| } catch (final IOException e) { | ||
| LogManager.instance().log(this, Level.SEVERE, "Error on dropping WAL file '%s'", e, file); | ||
| } | ||
| } | ||
| } else { | ||
| // Close WAL files without deleting: preserve for manual inspection after gap detection | ||
| for (final WALFile file : activeWALFilePool) { | ||
| try { | ||
| file.close(); | ||
| } catch (final IOException e) { | ||
| LogManager.instance().log(this, Level.WARNING, "Error on closing WAL file '%s'", e, file); | ||
| } | ||
| } | ||
| LogManager.instance().log(this, Level.SEVERE, | ||
| "WAL files for database '%s' have been preserved in '%s' for manual inspection.", | ||
| null, database, database.getDatabasePath()); | ||
| } |
There was a problem hiding this comment.
If any WAL file failed to initialize (e.g., due to a FileNotFoundException), its entry in activeWALFilePool will be null. Iterating over the pool and calling file.drop() or file.close() without a null check will throw a NullPointerException, abruptly aborting the recovery cleanup. Adding a null check ensures robust cleanup.
if (!walGapDetected) {
// REMOVE ALL WAL FILES
for (final WALFile file : activeWALFilePool) {
if (file != null) {
try {
file.drop();
LogManager.instance().log(this, Level.FINE, "Dropped WAL file '%s'", null, file);
} catch (final IOException e) {
LogManager.instance().log(this, Level.SEVERE, "Error on dropping WAL file '%s'", e, file);
}
}
}
} else {
// Close WAL files without deleting: preserve for manual inspection after gap detection
for (final WALFile file : activeWALFilePool) {
if (file != null) {
try {
file.close();
} catch (final IOException e) {
LogManager.instance().log(this, Level.WARNING, "Error on closing WAL file '%s'", e, file);
}
}
}
LogManager.instance().log(this, Level.SEVERE,
"WAL files for database '%s' have been preserved in '%s' for manual inspection.",
null, database, database.getDatabasePath());
}|
Review posted via file |
|
Code Review Overview This PR fixes a real data-corruption bug in WAL crash recovery (fixes #4331). The root cause is well-diagnosed: checkIntegrity() passed ignoreErrors=true to applyChanges(), which caused a version gap to silently skip a single page while continuing to apply sibling pages in the same transaction - a torn transaction. All WAL files were then unconditionally dropped, making the corruption unrecoverable. The fix is conceptually correct. Positive Aspects
Issues
When a gap is detected, two separate SEVERE messages are emitted: one inside the catch block (has txId) and one after the close loop (has path). Consolidate into one message to avoid operator confusion - include both the txId and the path in the catch block message, and remove the second SEVERE log after the close loop.
In checkIntegrityPreservesWALOnVersionGap, both db and database reference the same object. After db.kill(), calling database.close() on the already-killed instance may silently succeed or throw depending on internal state. Other TestHelper-based tests in the codebase use kill() OR close(), not both in sequence. Verify this does not cause resource leaks or spurious failures, and that the database = factory.open() reassignment at the end does not interfere with TestHelper.endTest() cleanup.
applyChangesThrowsOnVersionGap and versionGapPageNotApplied build the same gapped WAL page and both assert WALVersionGapException is thrown. The second only adds a page-version-unchanged assertion. The two tests could be merged into one. Keeping both as-is adds noise without meaningful additional coverage.
The injected gap WAL entry uses 512 for currentPageSize, which may not match the actual page size on disk. If applyChanges uses this field before reaching the version-gap check, the test could fail for a different reason or pass for the wrong reason. Consider using the actual page size already obtained earlier in the test (file.getPageSize()).
docs/4331-wal-version-gap-recovery.md contains implementation notes already captured in the PR description. Per project conventions, analysis and fix-explanation docs are not committed to the repository. Please remove this file.
The multi-line block comments explaining the lastTxId != -1 guard and the transactionIds.set() call are longer than the code they annotate. One short line expressing the why (avoid ID reuse when no transactions were replayed) is sufficient. Test Coverage Gap The tests confirm WAL files are preserved after a gap but do not assert that records written before the gap are still readable after reopening the database. A post-reopen count or query assertion would confirm partial recovery is actually correct, not just that WAL files survive on disk. Summary The fix is sound and addresses a genuine corruption risk. Main items to address before merging:
|
Code ReviewThe fix is well-targeted and the reasoning in the PR description is clear. The two transaction-ID bugs caught by the earlier Gemini review were correctly addressed. A few remaining issues: Critical - NPE in walPositions initialization loop (incomplete null-check fix)The Gemini review flagged missing null-checks on // TransactionManager.java lines 247-251
final WALFile.WALTransaction[] walPositions = new WALFile.WALTransaction[activeWALFilePool.length];
for (int i = 0; i < activeWALFilePool.length; ++i) {
final WALFile file = activeWALFilePool[i];
walPositions[i] = file.getFirstTransaction(); // NPE if activeWALFilePool[i] is null
}
for (int i = 0; i < activeWALFilePool.length; ++i) {
final WALFile file = activeWALFilePool[i];
walPositions[i] = file != null ? file.getFirstTransaction() : null;
}The main while loop already handles Minor - Two SEVERE messages are emitted for the same gap eventWhen a gap is detected, both the catch block (line 280) and the post-loop section (line 319) emit a } catch (WALVersionGapException e) {
LogManager.instance().log(this, Level.SEVERE,
"Recovery aborted for database '%s': version gap in WAL (txId=%d). WAL files preserved in '%s' for manual inspection. No further transactions will be replayed.",
null, database, lowerTxId, database.getDatabasePath());
walGapDetected = true;
break;
}Then the else-branch after the cleanup close loop can just log at FINE or be removed. Minor -
|
…files checkIntegrity() passed ignoreErrors=true to applyChanges(), which on a version gap logged a WARNING and continued to the next page in the same transaction. Sibling pages were applied while the gapped page was skipped, leaving a torn transaction on disk. Afterwards all WAL files were dropped, making the inconsistency permanent and undiagnosable. The fix passes ignoreErrors=false so WALVersionGapException is thrown on the first gap, immediately aborting the entire transaction replay. The exception is caught in checkIntegrity(): WAL files are closed without deletion (preserved for manual inspection) and a SEVERE log is emitted. A new WAL pool is always created so the database can continue writing after partial recovery.
… reset when empty, null-check WAL files in cleanup
92d981b to
a58ea0d
Compare
Code ReviewSummary: This PR correctly fixes a real data-corruption bug: Core logic - correct
Issues1. WAL preservation window is shorter than the log implies (important)
// TransactionManager.java close()
else {
File[] walFiles = dir.listFiles((dir1, name) -> name.endsWith(".wal"));
if (walFiles != null)
Stream.of(walFiles).forEach(File::delete); // deletes preserved gap WALs tooThe preserved files are not in Consider either:
2. Pre-existing null-safety gap in the early close loop (minor, but the PR is adjacent)Lines 226-234 iterate for (final WALFile file : activeWALFilePool) {
try {
file.close(); // NPE if createWALFilePool() left this slot null
} catch (final IOException e) {
// NullPointerException is not IOException - would escape
}
}The PR added null checks in the two cleanup loops lower in the same method but missed this one. Worth fixing in the same pass. 3.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4356 +/- ##
============================================
- Coverage 64.72% 64.37% -0.35%
- Complexity 446 457 +11
============================================
Files 1647 1647
Lines 127719 127739 +20
Branches 27375 27380 +5
============================================
- Hits 82660 82227 -433
- Misses 33384 33945 +561
+ Partials 11675 11567 -108 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Closes #4331
checkIntegrity()passedignoreErrors=truetoapplyChanges(). When a WAL page version was more than one ahead of the on-disk version (a version gap), the code logged a WARNING and calledcontinue, skipping that page but continuing to apply sibling pages in the same multi-page transaction. The result was a torn transaction on disk. All WAL files were then dropped unconditionally, making the corruption permanent and undiagnosable.The fix passes
ignoreErrors=falsesoWALVersionGapExceptionis thrown immediately on the first gap, stopping the entire transaction replay before any sibling pages are written. The exception is caught incheckIntegrity(): WAL files are closed without deletion (preserved for manual inspection at the database path), a SEVERE log is emitted, and a new WAL pool is created so the database remains writable after partial recovery.Test plan
WALVersionGapRecoveryTest.applyChangesThrowsOnVersionGap: confirmsWALVersionGapExceptionis raised when the WAL page version skips aheadWALVersionGapRecoveryTest.versionGapPageNotApplied: confirms the gapped page is not written to disk (no torn transaction)WALVersionGapRecoveryTest.checkIntegrityPreservesWALOnVersionGap: integration test - kills the database, injects a crafted WAL file with a version gap, reopens (triggeringcheckIntegrity()), and verifies WAL files are still present on disk rather than droppedTransactionManagerCloseWALFsyncTest,ApplyChangesPartialReplayTest,WALFileGetTransactionTest, 65 tests total)