Skip to content

fix(#4332): fsync data files before deleting WAL on clean close#4345

Merged
robfrank merged 4 commits into
mainfrom
fix/4332-close-fsync-before-wal-delete
May 26, 2026
Merged

fix(#4332): fsync data files before deleting WAL on clean close#4345
robfrank merged 4 commits into
mainfrom
fix/4332-close-fsync-before-wal-delete

Conversation

@robfrank

Copy link
Copy Markdown
Collaborator

Closes #4332

Summary

On a clean database close, LocalDatabase.internalClose() drained the page-manager queue (waitAllPagesOfDatabaseAreFlushed) and then closed all FileChannel handles, after which TransactionManager.close() deleted every .wal file. No channel.force() was ever issued, so committed data was only in the OS buffer cache when the WAL was removed. A power cut or kernel panic after WAL deletion but before the OS flushed its dirty-page cache permanently lost committed transactions with no recovery path.

The fix adds three small, focused changes:

  • PaginatedComponentFile.force(boolean metaData) - thin wrapper around FileChannel.force().
  • FileManager.syncFiles() - iterates all registered PaginatedComponentFile instances and calls force(true) on each; individual failures are logged as warnings but do not abort the rest.
  • LocalDatabase.internalClose() - calls fileManager.syncFiles() immediately after waitAllPagesOfDatabaseAreFlushed() and before the channels are closed, but only on non-drop closes (durability is not required when the database is being permanently removed).

Test plan

  • TransactionManagerCloseWALFsyncTest.noWalFilesAfterCleanClose - commits 50 records, closes the database, asserts no .wal files remain.
  • TransactionManagerCloseWALFsyncTest.dataReadableAfterReopenWithoutWALRecovery - commits 50 records, closes, manually deletes any leftover WAL files, reopens, and asserts all 50 records are readable - proving data pages were persisted independently of the WAL.
  • Existing transaction and WAL tests: TransactionBucketTest, TransactionTypeTest, LocalDatabaseLastTransactionIdTest, TransactionContextRemoveFileTest, PageManagerFlushQueueRaceTest, WALFileGetTransactionTest - all pass.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses issue #4332 by ensuring that all committed data pages are synced to physical storage (fsynced) before WAL files are deleted during a clean database close. This is achieved by adding a force method to PaginatedComponentFile, a syncFiles method to FileManager, and calling syncFiles in LocalDatabase.closeInternal when the database is not being dropped. A regression test suite has also been added. The reviewer suggests wrapping the channel.force call in a try-catch block to handle ClosedChannelException and retry after reopening the channel, which aligns with how read/write operations are handled in the codebase.

Comment on lines +54 to +57
public void force(final boolean metaData) throws IOException {
if (channel != null)
channel.force(metaData);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure robustness against ClosedChannelException (which can occur if the thread was interrupted during prior I/O operations), consider wrapping the channel.force(metaData) call in a try-catch block to reopen the channel and retry, similar to how it is handled in the read and write methods of this class.

  public void force(final boolean metaData) throws IOException {
    if (channel != null) {
      try {
        channel.force(metaData);
      } catch (final ClosedChannelException e) {
        LogManager.instance().log(this, Level.SEVERE, "File '%s' was closed on force. Reopen it and retry...", null, fileName);
        open(filePath, mode);
        channel.force(metaData);
      }
    }
  }

@codacy-production

codacy-production Bot commented May 26, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 22 complexity

Metric Results
Complexity 22

View in Codacy

🟢 Coverage 63.16% diff coverage

Metric Results
Coverage variation Report missing for a68f6631
Diff coverage 63.16% diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (a68f663) Report Missing Report Missing Report Missing
Head commit (b5b2285) 159356 105161 65.99%

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#4345) 19 12 63.16%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

1 Codacy didn't receive coverage data for the commit, or there was an error processing the received data. Check your integration for errors and validate that your coverage setup is correct.

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@claude

claude Bot commented May 26, 2026

Copy link
Copy Markdown

Code Review - PR #4345: fsync data files before deleting WAL on clean close

This is a well-targeted fix for a real durability gap. The root cause analysis is accurate and the placement of syncFiles() is correct. A few observations below.


Core fix - correctness

The fix is placed correctly: inside executeInWriteLock, after waitAllPagesOfDatabaseAreFlushed, before fileManager.close() and transactionManager.close(drop). The !drop guard is also correct - there is no durability requirement when the database is being permanently removed.


Concern: silently swallowing fsync failures still deletes WAL

FileManager.syncFiles() catches IOException per-file and logs a WARNING, then the caller proceeds to transactionManager.close(drop) which deletes WAL files. If a force() call fails (disk full, I/O error), WAL files are still deleted, leaving committed data unrecoverable.

// FileManager.java - syncFiles() logs warning and continues on failure
} catch (final IOException e) {
    LogManager.instance().log(this, Level.WARNING, "Error on syncing file '%s' to disk", e, f.getFileName());
}

The fix correctly closes the gap for the normal case, but a failed fsync followed by WAL deletion is just as unrecoverable as the original bug. Consider whether the log level should be SEVERE or whether a failure should propagate to prevent WAL deletion.


Minor: null-safety inconsistency in test

noWalFilesAfterCleanClose does not guard against a null return from listFiles():

// line 54 - no null check
final File[] walFiles = new File(dbPath).listFiles((dir, name) -> name.endsWith(".wal"));
assertThat(walFiles).isEmpty(); // NPE / assertion error if dbPath is not a directory

The second test method handles this correctly with if (walFiles != null). The path comes from the running database so it should always be a valid directory, but the inconsistency is worth fixing for symmetry and to avoid confusing failures if the path ever differs:

assertThat(walFiles).as("No WAL files must remain after a clean close").isNotNull().isEmpty();

Minor: import ordering in test does not match project convention

Looking at WALFileGetTransactionTest.java in the same package, the project convention is JUnit/third-party imports first, then java.*, then static. The new test has com.arcadedb.* first, then java.*, then org.junit.*, then static, which differs from the convention used in the surrounding test files.


instanceof PaginatedComponentFile pattern note

The pattern check is a sound defensive practice, though in practice all files registered via FileManager.registerFile are already PaginatedComponentFile instances. No action needed, just an observation.


Test coverage

The two regression tests cover the intended scenarios well. dataReadableAfterReopenWithoutWALRecovery is particularly useful as a smoke test. The limitation is inherent - proving actual fsync behavior would require simulating a power cut, which is impractical in unit tests.


Summary

The fix is correct and addresses the described bug. The main actionable items are:

  1. Consider propagating or logging at SEVERE when force() fails in syncFiles(), since silently continuing to delete WAL files after a failed fsync leaves committed data unrecoverable.
  2. Add null check in noWalFilesAfterCleanClose for consistency with the second test method.

robfrank added 4 commits May 26, 2026 14:23
On a clean database close, LocalDatabase.internalClose() called
PageManager.waitAllPagesOfDatabaseAreFlushed() (ensuring all dirty pages
were written to the OS buffer cache) and then fileManager.close() (closing
the FileChannel handles), followed by transactionManager.close() deleting
the WAL files. No channel.force() was ever issued, so the data pages were
only in the OS buffer cache when the WAL was removed. A power cut or OS
crash after WAL deletion but before the OS flushed its cache permanently
lost committed transactions with no recovery path.

Fix: add PaginatedComponentFile.force(boolean), add FileManager.syncFiles()
which calls force(true) on every open PaginatedComponentFile, and invoke
fileManager.syncFiles() inside LocalDatabase.internalClose() after all
pending page writes are drained and before the file channels are closed.
This guarantees all committed data reaches physical storage before the WAL
is deleted. The sync is skipped on drop() since durability is not required
when the database is being permanently removed.

Closes #4332
@robfrank robfrank force-pushed the fix/4332-close-fsync-before-wal-delete branch from c82c316 to b5b2285 Compare May 26, 2026 12:23
@codecov

codecov Bot commented May 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 52.63158% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.29%. Comparing base (128872a) to head (b5b2285).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...va/com/arcadedb/engine/PaginatedComponentFile.java 33.33% 5 Missing and 1 partial ⚠️
...src/main/java/com/arcadedb/engine/FileManager.java 62.50% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4345      +/-   ##
============================================
- Coverage     65.37%   64.29%   -1.09%     
- Complexity        0      444     +444     
============================================
  Files          1610     1647      +37     
  Lines        123789   127678    +3889     
  Branches      26776    27365     +589     
============================================
+ Hits          80925    82085    +1160     
- Misses        31429    34017    +2588     
- Partials      11435    11576     +141     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@robfrank robfrank merged commit 0c83f07 into main May 26, 2026
27 of 40 checks passed
@lvca lvca added this to the 26.6.1 milestone May 26, 2026
@robfrank robfrank self-assigned this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TransactionManager.close(false) deletes WAL files without fsyncing data pages

2 participants