fix(#4320): WALFile.getTransaction distinguishes I/O error from truncation#4338
Conversation
…Transaction - Loop channel.read() until the delta buffer is full, treating EOF as truncation. - Split the catch block so IOException re-throws as WALException, preserving the null sentinel only for genuine end-of-WAL / invalid magic-number cases. Before the fix, any I/O error during recovery was indistinguishable from clean EOF and silently dropped every committed transaction after the failing read. Short reads (return value never checked) also produced garbage page deltas.
There was a problem hiding this comment.
Code Review
This pull request addresses two critical bugs in WALFile.getTransaction(): it ensures short reads are fully resolved by looping until the buffer is populated, and it correctly propagates IOException as a WALException instead of swallowing it as a clean EOF. Additionally, regression tests and documentation are provided. The review feedback suggests validating deltaSize and page boundaries prior to buffer allocation to prevent potential memory exhaustion on corrupted files, and optimizing the truncation test by reusing a single temporary file and adjusting its length dynamically rather than repeatedly creating and deleting files.
| @@ -203,7 +203,14 @@ public WALTransaction getTransaction(long pos) { | |||
| final ByteBuffer buffer = ByteBuffer.allocate(deltaSize); | |||
There was a problem hiding this comment.
To prevent potential OutOfMemoryError or IllegalArgumentException when reading corrupted or malicious WAL files, we should validate the page delta boundaries and page size before allocating the buffer.
| final ByteBuffer buffer = ByteBuffer.allocate(deltaSize); | |
| if (deltaSize <= 0 || tx.pages[i].changesFrom < 0 || tx.pages[i].changesTo >= tx.pages[i].currentPageSize || tx.pages[i].currentPageSize > 1024 * 1024) { | |
| return null; | |
| } | |
| final ByteBuffer buffer = ByteBuffer.allocate(deltaSize); |
| void truncatedWalReturnsNull() throws Exception { | ||
| // Truncate one byte before the magic-number footer completes | ||
| final int fullSize = TX_HEADER_SIZE + SEG_SIZE + TX_FOOTER_SIZE; | ||
| for (int truncPoint = 0; truncPoint < fullSize; truncPoint++) { | ||
| final File truncFile = File.createTempFile("wal-trunc-", ".wal"); | ||
| truncFile.deleteOnExit(); | ||
| WALFile wf = null; | ||
| try { | ||
| writeTruncatedTransaction(truncFile, truncPoint); | ||
| wf = new WALFile(truncFile.getAbsolutePath()); | ||
| assertThat(wf.getTransaction(0)) | ||
| .as("truncated at byte %d should return null", truncPoint) | ||
| .isNull(); | ||
| } finally { | ||
| if (wf != null) | ||
| wf.close(); | ||
| truncFile.delete(); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Creating and deleting 70 temporary files in a loop is highly inefficient and can significantly slow down test execution. We can optimize this by creating a single temporary file, writing the valid transaction once, and then truncating it backwards from fullSize - 1 down to 0 using RandomAccessFile.setLength().
@Test
void truncatedWalReturnsNull() throws Exception {
// Truncate one byte at a time from the end of the transaction
final int fullSize = TX_HEADER_SIZE + SEG_SIZE + TX_FOOTER_SIZE;
final File truncFile = File.createTempFile("wal-trunc-", ".wal");
truncFile.deleteOnExit();
try {
writeValidTransaction(truncFile);
for (int truncPoint = fullSize - 1; truncPoint >= 0; truncPoint--) {
try (RandomAccessFile raf = new RandomAccessFile(truncFile, "rw")) {
raf.setLength(truncPoint);
}
try (WALFile wf = new WALFile(truncFile.getAbsolutePath())) {
assertThat(wf.getTransaction(0))
.as("truncated at byte %d should return null", truncPoint)
.isNull();
}
}
} finally {
truncFile.delete();
}
}
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Coverage variation | ✅ -37.82% coverage variation |
| Diff coverage | ✅ 63.64% diff coverage |
Coverage variation details
Coverable lines Covered lines Coverage Common ancestor commit (c4cf514) 127540 93469 73.29% Head commit (ffae26e) 159228 (+31688) 56466 (-37003) 35.46% (-37.82%) Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch:
<coverage of head commit> - <coverage of common ancestor commit>
Diff coverage details
Coverable lines Covered lines Diff coverage Pull request (#4338) 11 7 63.64% Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified:
<covered lines added or modified>/<coverable lines added or modified> * 100%
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Code ReviewThe fix is correct and well-motivated. This is a genuine data-safety bug - silently swallowing IOException during WAL recovery could mask disk failures and lead to silent data loss. Two targeted comments below. Production code (WALFile.java) - looks goodThe read loop and the catch-block split are both correct:
Residual short-read risk in
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4338 +/- ##
=============================================
- Coverage 64.21% 33.65% -30.56%
- Complexity 448 892 +444
=============================================
Files 1645 1645
Lines 127540 127550 +10
Branches 27350 27353 +3
=============================================
- Hits 81899 42932 -38967
- Misses 34102 76635 +42533
+ Partials 11539 7983 -3556 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
….allocate Addresses gemini-code-assist HIGH-priority review comment on PR #4338 cycle 1: reject obviously corrupted page headers (deltaSize <= 0 or negative changesFrom) before allocating, so a corrupted WAL cannot produce IllegalArgumentException or have garbage delta bytes applied to disk. Dropped the suggested currentPageSize > 1MB / changesTo >= currentPageSize check: currentPageSize in the WAL is the page CONTENT size, not the physical size, and ArcadeDB's configurable page sizes (256KB default for external-property buckets, configurable higher) make the 1MB cap unsafe. The existing segmentSize file-size check at line 170 already bounds total allocation by available file bytes. Test-perf optimization (single temp file + setLength) deferred - the suggested form does not compile (WALFile is not AutoCloseable). Rationale recorded in docs/review-deferred-acd9fdf2.md. Verified: ACIDTransactionTest 11/11, WALFileGetTransactionTest 3/3 still pass.
Code Review - PR #4338: WALFile.getTransaction distinguishes I/O error from truncationOverall: The fix is correct and well-motivated. Two genuine bugs are addressed with minimal, targeted changes. Test coverage is solid. A few issues worth discussing before merging. Bugs Fixed (Correct)Both fixes are accurate:
Issues1.
|
| Item | Severity |
|---|---|
docs/review-deferred-acd9fdf2.md committed |
Must fix - should not be in repo |
readLong()/readInt() same short-read bug unfixed |
Should fix |
Upper-bound missing on deltaSize |
Nice to have |
| Comment verbosity | Minor |
| Test fragility (reflection, duplicated constants) | Minor |
The core fix is correct. Addressing the readLong/readInt gap and removing the review artifact would make this ready to merge.
Code Review - PR #4338: WALFile.getTransaction I/O error vs truncation fixOverviewThis PR fixes two real bugs in
The fix is well-targeted and minimal. The test coverage is solid. Issues1. Multi-line comment violates project style (WALFile.java:203-208) The 4-line block comment above the validation guard conflicts with the project style guide: no multi-line comment blocks - one short line max. The guard is self-explanatory; at most a single-line note is warranted: if (deltaSize <= 0 || tx.pages[i].changesFrom < 0) // reject corrupted header
return null;2. docs/ files appear to be AI review-workflow artifacts
3. Test layout constants duplicate WALFile private statics
Options:
NitsRead loop: potential spin on n == 0 For a blocking Reflection in ioErrorOnChannelThrowsWALException
What is correct
SummaryThe core fix is correct and the test coverage is good. Three actionable items before merging:
|
Closes #4320
Summary
WALFile.getTransaction()had two bugs that made WAL recovery silently drop committed transactions on transient I/O errors:channel.read(buffer, pos)return value was never inspected. A short read left the deltaByteBufferpartially populated, producing garbage page deltas that were then applied to disk.IOException-catch (Exception e) { return null; }used the samenullsentinel that signals clean EOF / truncated WAL. AnyIOExceptionmid-replay was indistinguishable from end-of-WAL, so recovery stopped without surfacing the error.Fix
Two targeted changes in
getTransaction():channel.read()until the delta buffer is full, treatingn == -1as truncation (returnnull).IOExceptionre-throws asWALException; the broadcatch (Exception)is retained so non-I/O parse errors still returnnullas before.Clean EOF, truncated WAL, and invalid magic-number cases continue to return
null- no behaviour change on the normal path.Test plan
WALFileGetTransactionTest(new) covers:validTransactionIsReadCorrectly- all fields read back from a valid WAL transactiontruncatedWalReturnsNull- returnsnullat every truncation point inside a transactionioErrorOnChannelThrowsWALException- closing the channel before reading now throwsWALException, notnull(pre-fix: failed)