fix(#4332): fsync data files before deleting WAL on clean close#4345
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses issue #4332 by ensuring that all committed data pages are synced to physical storage (fsynced) before WAL files are deleted during a clean database close. This is achieved by adding a force method to PaginatedComponentFile, a syncFiles method to FileManager, and calling syncFiles in LocalDatabase.closeInternal when the database is not being dropped. A regression test suite has also been added. The reviewer suggests wrapping the channel.force call in a try-catch block to handle ClosedChannelException and retry after reopening the channel, which aligns with how read/write operations are handled in the codebase.
| public void force(final boolean metaData) throws IOException { | ||
| if (channel != null) | ||
| channel.force(metaData); | ||
| } |
There was a problem hiding this comment.
To ensure robustness against ClosedChannelException (which can occur if the thread was interrupted during prior I/O operations), consider wrapping the channel.force(metaData) call in a try-catch block to reopen the channel and retry, similar to how it is handled in the read and write methods of this class.
public void force(final boolean metaData) throws IOException {
if (channel != null) {
try {
channel.force(metaData);
} catch (final ClosedChannelException e) {
LogManager.instance().log(this, Level.SEVERE, "File '%s' was closed on force. Reopen it and retry...", null, fileName);
open(filePath, mode);
channel.force(metaData);
}
}
}
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 22 |
🟢 Coverage 63.16% diff coverage
Metric Results Coverage variation Report missing for a68f6631 Diff coverage ✅ 63.16% diff coverage Coverage variation details
Coverable lines Covered lines Coverage Common ancestor commit (a68f663) Report Missing Report Missing Report Missing Head commit (b5b2285) 159356 105161 65.99% Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch:
<coverage of head commit> - <coverage of common ancestor commit>Diff coverage details
Coverable lines Covered lines Diff coverage Pull request (#4345) 19 12 63.16% Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified:
<covered lines added or modified>/<coverable lines added or modified> * 100%1 Codacy didn't receive coverage data for the commit, or there was an error processing the received data. Check your integration for errors and validate that your coverage setup is correct.
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Code Review - PR #4345: fsync data files before deleting WAL on clean closeThis is a well-targeted fix for a real durability gap. The root cause analysis is accurate and the placement of Core fix - correctnessThe fix is placed correctly: inside Concern: silently swallowing fsync failures still deletes WAL
// FileManager.java - syncFiles() logs warning and continues on failure
} catch (final IOException e) {
LogManager.instance().log(this, Level.WARNING, "Error on syncing file '%s' to disk", e, f.getFileName());
}The fix correctly closes the gap for the normal case, but a failed fsync followed by WAL deletion is just as unrecoverable as the original bug. Consider whether the log level should be Minor: null-safety inconsistency in test
// line 54 - no null check
final File[] walFiles = new File(dbPath).listFiles((dir, name) -> name.endsWith(".wal"));
assertThat(walFiles).isEmpty(); // NPE / assertion error if dbPath is not a directoryThe second test method handles this correctly with assertThat(walFiles).as("No WAL files must remain after a clean close").isNotNull().isEmpty();Minor: import ordering in test does not match project conventionLooking at
|
On a clean database close, LocalDatabase.internalClose() called PageManager.waitAllPagesOfDatabaseAreFlushed() (ensuring all dirty pages were written to the OS buffer cache) and then fileManager.close() (closing the FileChannel handles), followed by transactionManager.close() deleting the WAL files. No channel.force() was ever issued, so the data pages were only in the OS buffer cache when the WAL was removed. A power cut or OS crash after WAL deletion but before the OS flushed its cache permanently lost committed transactions with no recovery path. Fix: add PaginatedComponentFile.force(boolean), add FileManager.syncFiles() which calls force(true) on every open PaginatedComponentFile, and invoke fileManager.syncFiles() inside LocalDatabase.internalClose() after all pending page writes are drained and before the file channels are closed. This guarantees all committed data reaches physical storage before the WAL is deleted. The sync is skipped on drop() since durability is not required when the database is being permanently removed. Closes #4332
c82c316 to
b5b2285
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4345 +/- ##
============================================
- Coverage 65.37% 64.29% -1.09%
- Complexity 0 444 +444
============================================
Files 1610 1647 +37
Lines 123789 127678 +3889
Branches 26776 27365 +589
============================================
+ Hits 80925 82085 +1160
- Misses 31429 34017 +2588
- Partials 11435 11576 +141 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Closes #4332
Summary
On a clean database close,
LocalDatabase.internalClose()drained the page-manager queue (waitAllPagesOfDatabaseAreFlushed) and then closed allFileChannelhandles, after whichTransactionManager.close()deleted every.walfile. Nochannel.force()was ever issued, so committed data was only in the OS buffer cache when the WAL was removed. A power cut or kernel panic after WAL deletion but before the OS flushed its dirty-page cache permanently lost committed transactions with no recovery path.The fix adds three small, focused changes:
PaginatedComponentFile.force(boolean metaData)- thin wrapper aroundFileChannel.force().FileManager.syncFiles()- iterates all registeredPaginatedComponentFileinstances and callsforce(true)on each; individual failures are logged as warnings but do not abort the rest.LocalDatabase.internalClose()- callsfileManager.syncFiles()immediately afterwaitAllPagesOfDatabaseAreFlushed()and before the channels are closed, but only on non-drop closes (durability is not required when the database is being permanently removed).Test plan
TransactionManagerCloseWALFsyncTest.noWalFilesAfterCleanClose- commits 50 records, closes the database, asserts no.walfiles remain.TransactionManagerCloseWALFsyncTest.dataReadableAfterReopenWithoutWALRecovery- commits 50 records, closes, manually deletes any leftover WAL files, reopens, and asserts all 50 records are readable - proving data pages were persisted independently of the WAL.TransactionBucketTest,TransactionTypeTest,LocalDatabaseLastTransactionIdTest,TransactionContextRemoveFileTest,PageManagerFlushQueueRaceTest,WALFileGetTransactionTest- all pass.