Skip to content

fix(#4320): WALFile.getTransaction distinguishes I/O error from truncation#4338

Merged
robfrank merged 3 commits into
mainfrom
fix/4320-walfile-io-error-vs-truncation
May 26, 2026
Merged

fix(#4320): WALFile.getTransaction distinguishes I/O error from truncation#4338
robfrank merged 3 commits into
mainfrom
fix/4320-walfile-io-error-vs-truncation

Conversation

@robfrank

Copy link
Copy Markdown
Collaborator

Closes #4320

Summary

WALFile.getTransaction() had two bugs that made WAL recovery silently drop committed transactions on transient I/O errors:

  1. Unchecked short reads - channel.read(buffer, pos) return value was never inspected. A short read left the delta ByteBuffer partially populated, producing garbage page deltas that were then applied to disk.
  2. Swallowed IOException - catch (Exception e) { return null; } used the same null sentinel that signals clean EOF / truncated WAL. Any IOException mid-replay was indistinguishable from end-of-WAL, so recovery stopped without surfacing the error.

Fix

Two targeted changes in getTransaction():

  1. Loop channel.read() until the delta buffer is full, treating n == -1 as truncation (return null).
  2. Split the catch block: IOException re-throws as WALException; the broad catch (Exception) is retained so non-I/O parse errors still return null as before.

Clean EOF, truncated WAL, and invalid magic-number cases continue to return null - no behaviour change on the normal path.

Test plan

WALFileGetTransactionTest (new) covers:

  • validTransactionIsReadCorrectly - all fields read back from a valid WAL transaction
  • truncatedWalReturnsNull - returns null at every truncation point inside a transaction
  • ioErrorOnChannelThrowsWALException - closing the channel before reading now throws WALException, not null (pre-fix: failed)
  • Full engine test suite: 7403 tests, 0 failures, 0 errors

…Transaction

- Loop channel.read() until the delta buffer is full, treating EOF as truncation.
- Split the catch block so IOException re-throws as WALException, preserving the
  null sentinel only for genuine end-of-WAL / invalid magic-number cases.

Before the fix, any I/O error during recovery was indistinguishable from clean
EOF and silently dropped every committed transaction after the failing read.
Short reads (return value never checked) also produced garbage page deltas.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses two critical bugs in WALFile.getTransaction(): it ensures short reads are fully resolved by looping until the buffer is populated, and it correctly propagates IOException as a WALException instead of swallowing it as a clean EOF. Additionally, regression tests and documentation are provided. The review feedback suggests validating deltaSize and page boundaries prior to buffer allocation to prevent potential memory exhaustion on corrupted files, and optimizing the truncation test by reusing a single temporary file and adjusting its length dynamically rather than repeatedly creating and deleting files.

@@ -203,7 +203,14 @@ public WALTransaction getTransaction(long pos) {
final ByteBuffer buffer = ByteBuffer.allocate(deltaSize);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

To prevent potential OutOfMemoryError or IllegalArgumentException when reading corrupted or malicious WAL files, we should validate the page delta boundaries and page size before allocating the buffer.

Suggested change
final ByteBuffer buffer = ByteBuffer.allocate(deltaSize);
if (deltaSize <= 0 || tx.pages[i].changesFrom < 0 || tx.pages[i].changesTo >= tx.pages[i].currentPageSize || tx.pages[i].currentPageSize > 1024 * 1024) {
return null;
}
final ByteBuffer buffer = ByteBuffer.allocate(deltaSize);

Comment on lines +107 to +126
void truncatedWalReturnsNull() throws Exception {
// Truncate one byte before the magic-number footer completes
final int fullSize = TX_HEADER_SIZE + SEG_SIZE + TX_FOOTER_SIZE;
for (int truncPoint = 0; truncPoint < fullSize; truncPoint++) {
final File truncFile = File.createTempFile("wal-trunc-", ".wal");
truncFile.deleteOnExit();
WALFile wf = null;
try {
writeTruncatedTransaction(truncFile, truncPoint);
wf = new WALFile(truncFile.getAbsolutePath());
assertThat(wf.getTransaction(0))
.as("truncated at byte %d should return null", truncPoint)
.isNull();
} finally {
if (wf != null)
wf.close();
truncFile.delete();
}
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Creating and deleting 70 temporary files in a loop is highly inefficient and can significantly slow down test execution. We can optimize this by creating a single temporary file, writing the valid transaction once, and then truncating it backwards from fullSize - 1 down to 0 using RandomAccessFile.setLength().

  @Test
  void truncatedWalReturnsNull() throws Exception {
    // Truncate one byte at a time from the end of the transaction
    final int fullSize = TX_HEADER_SIZE + SEG_SIZE + TX_FOOTER_SIZE;
    final File truncFile = File.createTempFile("wal-trunc-", ".wal");
    truncFile.deleteOnExit();
    try {
      writeValidTransaction(truncFile);
      for (int truncPoint = fullSize - 1; truncPoint >= 0; truncPoint--) {
        try (RandomAccessFile raf = new RandomAccessFile(truncFile, "rw")) {
          raf.setLength(truncPoint);
        }
        try (WALFile wf = new WALFile(truncFile.getAbsolutePath())) {
          assertThat(wf.getTransaction(0))
              .as("truncated at byte %d should return null", truncPoint)
              .isNull();
        }
      }
    } finally {
      truncFile.delete();
    }
  }

@codacy-production

codacy-production Bot commented May 25, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Coverage 63.64% diff coverage · -37.82% coverage variation

Metric Results
Coverage variation -37.82% coverage variation
Diff coverage 63.64% diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (c4cf514) 127540 93469 73.29%
Head commit (ffae26e) 159228 (+31688) 56466 (-37003) 35.46% (-37.82%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#4338) 11 7 63.64%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@claude

claude Bot commented May 25, 2026

Copy link
Copy Markdown

Code Review

The fix is correct and well-motivated. This is a genuine data-safety bug - silently swallowing IOException during WAL recovery could mask disk failures and lead to silent data loss. Two targeted comments below.

Production code (WALFile.java) - looks good

The read loop and the catch-block split are both correct:

  • The readPos tracking variable is cleanly separated from pos (which still advances by deltaSize after the loop), avoiding an off-by-one if the loop takes multiple iterations.
  • A blocking FileChannel opened from RandomAccessFile cannot return 0, so the loop terminates on either n == -1 (truncation) or a full buffer - no infinite loop risk.
  • WALException is a RuntimeException, so the new throw propagates correctly through TransactionManager.checkIntegrity() (which has only a finally block) and up to LocalDatabase.checkForRecovery() where it surfaces as a database-open failure. That is the right behavior.

Residual short-read risk in readLong / readInt

The private helpers at lines 358-368 use a single channel.read(bufferLong, pos) / channel.read(bufferInt, pos) without checking the return value or looping:

private long readLong(final long pos) throws IOException {
    bufferLong.rewind();
    channel.read(bufferLong, pos);   // return value ignored - same class of bug
    return bufferLong.getLong(0);
}

For 4-8 byte reads on a local disk this is extremely unlikely to cause problems in practice (kernel page cache fills small reads atomically), but it is the same pattern that was just fixed in the delta loop. Worth a follow-up issue or a note so it does not get forgotten.

Test file - minor style issue

The class-level Javadoc is a multi-paragraph comment block:

/**
 * Unit tests for {@link WALFile#getTransaction(long)}.
 *
 * Covers:
 * 1. Valid transactions are read back with all fields intact.
 * ...
 */

The project style guide says no multi-paragraph docstrings. The information is already captured in the test method names and the PR description, so this block can be dropped entirely.

The reflection-based channel close in ioErrorOnChannelThrowsWALException is the right call - there is no other practical way to inject a channel failure without restructuring WALFile. Worth a brief note if the field is ever renamed.

Summary

The core fix is correct and the test coverage is solid (truncation at every byte is a particularly thorough touch). The two items above are minor: the residual readLong/readInt short-read issue is a pre-existing concern worth tracking separately, and the multi-line class Javadoc can be trimmed to match project style. Both are non-blocking.

@codecov

codecov Bot commented May 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 45.45455% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 33.65%. Comparing base (c4cf514) to head (ffae26e).

Files with missing lines Patch % Lines
...ine/src/main/java/com/arcadedb/engine/WALFile.java 45.45% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##               main    #4338       +/-   ##
=============================================
- Coverage     64.21%   33.65%   -30.56%     
- Complexity      448      892      +444     
=============================================
  Files          1645     1645               
  Lines        127540   127550       +10     
  Branches      27350    27353        +3     
=============================================
- Hits          81899    42932    -38967     
- Misses        34102    76635    +42533     
+ Partials      11539     7983     -3556     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

….allocate

Addresses gemini-code-assist HIGH-priority review comment on PR #4338 cycle 1:
reject obviously corrupted page headers (deltaSize <= 0 or negative changesFrom)
before allocating, so a corrupted WAL cannot produce IllegalArgumentException or
have garbage delta bytes applied to disk.

Dropped the suggested currentPageSize > 1MB / changesTo >= currentPageSize check:
currentPageSize in the WAL is the page CONTENT size, not the physical size, and
ArcadeDB's configurable page sizes (256KB default for external-property buckets,
configurable higher) make the 1MB cap unsafe. The existing segmentSize file-size
check at line 170 already bounds total allocation by available file bytes.

Test-perf optimization (single temp file + setLength) deferred - the suggested
form does not compile (WALFile is not AutoCloseable). Rationale recorded in
docs/review-deferred-acd9fdf2.md.

Verified: ACIDTransactionTest 11/11, WALFileGetTransactionTest 3/3 still pass.
@claude

claude Bot commented May 25, 2026

Copy link
Copy Markdown

Code Review - PR #4338: WALFile.getTransaction distinguishes I/O error from truncation

Overall: The fix is correct and well-motivated. Two genuine bugs are addressed with minimal, targeted changes. Test coverage is solid. A few issues worth discussing before merging.


Bugs Fixed (Correct)

Both fixes are accurate:

  1. Short-read loop - channel.read() on a NIO FileChannel can legally return fewer bytes than requested even on a local file. The loop-until-full pattern is the correct fix.
  2. IOException split - Catching IOException separately and re-throwing as WALException is the right approach. The recovery loop in TransactionManager (lines 250/277) relies on null meaning "no more transactions" - swallowing a real I/O error there would silently drop committed data.

Issues

1. docs/review-deferred-acd9fdf2.md should not be in this PR

This file is an internal AI review tool artifact ("cycle 1, HEAD acd9fdf") and has no place in the repository. There is already another such file (docs/review-deferred-4eefd6c3a.md) in the repo - both should be cleaned up. These are not project documentation and will confuse future contributors.

2. readLong() and readInt() have the same short-read bug (not fixed)

// WALFile.java lines 365-375
private long readLong(final long pos) throws IOException {
  bufferLong.rewind();
  channel.read(bufferLong, pos);   // return value not checked - same bug
  return bufferLong.getLong(0);
}

private int readInt(final long pos) throws IOException {
  bufferInt.rewind();
  channel.read(bufferInt, pos);    // return value not checked - same bug
  return bufferInt.getInt(0);
}

These are called extensively in getTransaction() to read the transaction header and all page headers. A short read here would silently produce a wrong txId, wrong page count, wrong changesFrom/changesTo, etc. - the exact garbage-data-applied-to-disk problem the PR aims to fix. For small aligned reads on a local FileChannel short reads are rare in practice, but the correctness argument for fixing the delta buffer applies equally here.

3. Validation guard comment overstates the protection

// Reject obviously corrupted page headers so ByteBuffer.allocate cannot blow up with a
// negative size and a garbage delta cannot be applied to disk. The outer segment-size
// check above already bounds total memory by file size, so a per-page upper cap is not
// needed here.
if (deltaSize <= 0 || tx.pages[i].changesFrom < 0)
  return null;

The outer check at line 170 validates the declared segmentSize from the header against the file length. It does not bound individual deltaSize values within the segment. If changesTo is a garbage large value (e.g. from a partially-overwritten WAL), deltaSize could still be Integer.MAX_VALUE and ByteBuffer.allocate() would throw OutOfMemoryError (an Error, not caught by catch (Exception e)). Consider adding an upper bound: deltaSize > segmentSize is a cheap sanity check that would catch this case.

Also, changesTo < changesFrom is already covered by deltaSize <= 0, but checking changesTo < 0 is missing. It is implicitly safe only because a negative changesTo with a non-negative changesFrom would produce deltaSize <= 0. Fine, but worth noting.

4. Comment verbosity

Per project conventions, comments should only explain the non-obvious WHY. The 3-line comment on the validation guard documents what the code visibly does ("so ByteBuffer.allocate cannot blow up"). One line ("guard against negative delta from corrupted changesTo/From") is enough; the assertion in writeTransactionToBuffer at line 270 shows these are expected to be positive.


Test Coverage

Good overall. A few notes:

  • truncatedWalReturnsNull creates 70 temp files. Functionally fine (runs in ~74ms), but a single file truncated with FileChannel.truncate() in a loop would be cleaner and faster.
  • Reflection usage (extractChannel) is fragile - a field rename in WALFile silently breaks the test. Consider exposing a package-private closeChannelForTesting() helper, or restructuring the test to use a MockFileChannel wrapper. Given the test is in the same package (com.arcadedb.engine), package-private access is available.
  • Header size constants in the test duplicate the private constants from WALFile.java. If the WAL format changes, the tests could silently become wrong. Expose the constants as package-private or document the duplication risk clearly.
  • No test for readLong/readInt short reads - follows naturally from issue Bump gremlin.version from 3.4.10 to 3.5.1 #2 above.

Minor

  • docs/4320-walfile-io-error-vs-truncation.md duplicates the PR description almost verbatim. The docs folder already has several per-issue files (4274-..., 4275-...) so this follows a precedent, but the PR body is a better home for this content. Up to maintainer preference.

Summary

Item Severity
docs/review-deferred-acd9fdf2.md committed Must fix - should not be in repo
readLong()/readInt() same short-read bug unfixed Should fix
Upper-bound missing on deltaSize Nice to have
Comment verbosity Minor
Test fragility (reflection, duplicated constants) Minor

The core fix is correct. Addressing the readLong/readInt gap and removing the review artifact would make this ready to merge.

@claude

claude Bot commented May 25, 2026

Copy link
Copy Markdown

Code Review - PR #4338: WALFile.getTransaction I/O error vs truncation fix

Overview

This PR fixes two real bugs in WALFile.getTransaction():

  1. Unchecked short reads from FileChannel.read() could leave a delta buffer partially populated, silently applying garbage page content on recovery.
  2. A broad catch (Exception e) { return null; } made IOException indistinguishable from clean EOF, causing recovery to silently stop mid-replay.

The fix is well-targeted and minimal. The test coverage is solid.


Issues

1. Multi-line comment violates project style (WALFile.java:203-208)

The 4-line block comment above the validation guard conflicts with the project style guide: no multi-line comment blocks - one short line max.

The guard is self-explanatory; at most a single-line note is warranted:

if (deltaSize <= 0 || tx.pages[i].changesFrom < 0)  // reject corrupted header
  return null;

2. docs/ files appear to be AI review-workflow artifacts

docs/review-deferred-acd9fdf2.md contains tooling-internal state (review cycle IDs, bot deferred-item tracking). docs/4320-walfile-io-error-vs-truncation.md largely duplicates the PR description with additional review-cycle prose. This metadata belongs in the PR itself, not in the repository. Consider dropping these files from the commit, particularly review-deferred-acd9fdf2.md.


3. Test layout constants duplicate WALFile private statics

WALFileGetTransactionTest re-declares TX_HEADER_SIZE = 24, PAGE_HEADER_SIZE = 24, TX_FOOTER_SIZE = 12 as hard-coded values mirroring WALFile's private constants. If the WAL on-disk format ever changes, these silently drift and tests could pass while testing the wrong layout.

Options:

  • Widen WALFile's layout constants to package-private (same package: com.arcadedb.engine) so the test can reference them directly.
  • Or make the test format-agnostic by roundtripping through WALFile.writeTransactionToBuffer rather than building the binary manually.

Nits

Read loop: potential spin on n == 0

For a blocking FileChannel on a regular file this is not a practical concern, but the NIO contract does not formally exclude a 0-byte return in non-blocking mode. A defensive if (n == 0) continue; inside the loop would make the intent explicit at zero cost.

Reflection in ioErrorOnChannelThrowsWALException

getDeclaredField("channel") is fragile if the field is ever renamed. A brief comment noting that no public API exposes the channel would help future maintainers understand why reflection is used here.


What is correct

  • The read loop correctly handles partial reads and treats EOF mid-delta as truncation (returns null).
  • IOException is caught before the broad Exception catch - correct ordering, and the compiler enforces it.
  • deltaSize <= 0 || changesFrom < 0 correctly prevents ByteBuffer.allocate(negative), and the outer segmentSize check already bounds total memory allocation to the file size.
  • AssertJ assertions and JUnit 5 style match project conventions.
  • Temp files are cleaned up in finally blocks.
  • No new dependencies introduced.

Summary

The core fix is correct and the test coverage is good. Three actionable items before merging:

  1. Reduce the 4-line production comment to one line (or remove it).
  2. Remove docs/review-deferred-acd9fdf2.md (and optionally docs/4320-walfile-io-error-vs-truncation.md) - this metadata belongs in the PR, not the repository.
  3. Consider making WALFile's layout constants package-private so the test does not duplicate them with magic numbers.

@robfrank robfrank merged commit 003b354 into main May 26, 2026
24 of 30 checks passed
@robfrank robfrank deleted the fix/4320-walfile-io-error-vs-truncation branch May 26, 2026 09:16
@robfrank robfrank added this to the 26.6.1 milestone May 27, 2026
@robfrank robfrank self-assigned this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WALFile.getTransaction confuses I/O error with truncation — recovery silently drops committed transactions

1 participant