Fix false positive corruption in compaction output verification by xingbowang · Pull Request #14456 · facebook/rocksdb

xingbowang · 2026-03-12T18:54:38Z

Summary:
Fix a bug where VerifyOutputFiles() produces false positive "Key-value checksum of compaction output doesn't match what was computed when written" errors when verify_output_flags includes kVerifyIteration but paranoid_file_checks is false.

The root cause is a hash enable flag mismatch between writing and verification:

During compaction writing (OpenCompactionOutputFile), the OutputValidator hash computation was gated solely by paranoid_file_checks_. When false, enable_hash=false and the hash stays at 0.
During verification (VerifyOutputFiles), a new OutputValidator is always created with enable_hash=true, computing a non-zero hash.
CompareValidator() then compares 0 vs non-zero, producing a false positive corruption.

This was exposed by the crash test randomization of verify_output_flags added in #14433. Before that change, verify_output_flags was always 0, so kVerifyIteration was only exercised via paranoid_file_checks=true (which correctly enabled the hash during writing).

The fix ensures hash computation is enabled during writing whenever either paranoid_file_checks_ is true OR verify_output_flags includes kVerifyIteration.

Test Plan:

Added DBCompactionTest.VerifyIterationWithoutParanoidFileChecks
Added DBCompactionTest.VerifyAllOutputFlagsWithoutParanoidFileChecks
Round-trip verified: tests FAIL without fix, PASS with fix

github-actions · 2026-03-12T19:10:37Z

✅ clang-tidy: No findings on changed lines

Completed in 275.6s.

jaykorean

Left one nit comment. Thanks for fixing this!

jaykorean · 2026-03-12T19:18:38Z

db/compaction/compaction_job.cc

  for (size_t i = 0; i < output_files.size(); i++) {
    FileMetaData file_copy = output_files[i];

+    bool enable_output_hash =


nit: this bool can be checked outside of for loop and can be const

Good catch. Thank you.

Summary: Fix a bug where `VerifyOutputFiles()` produces false positive "Key-value checksum of compaction output doesn't match what was computed when written" errors when `verify_output_flags` includes `kVerifyIteration` but `paranoid_file_checks` is false. The root cause is a hash enable flag mismatch between writing and verification: - During compaction writing (`OpenCompactionOutputFile`), the `OutputValidator` hash computation was gated solely by `paranoid_file_checks_`. When false, `enable_hash=false` and the hash stays at 0. - During verification (`VerifyOutputFiles`), a new `OutputValidator` is always created with `enable_hash=true`, computing a non-zero hash. - `CompareValidator()` then compares 0 vs non-zero, producing a false positive corruption. This was exposed by the crash test randomization of `verify_output_flags` added in facebook#14433. Before that change, `verify_output_flags` was always 0, so `kVerifyIteration` was only exercised via `paranoid_file_checks=true` (which correctly enabled the hash during writing). The fix ensures hash computation is enabled during writing whenever either `paranoid_file_checks_` is true OR `verify_output_flags` includes `kVerifyIteration`. Test Plan: - Added `DBCompactionTest.VerifyIterationWithoutParanoidFileChecks` - Added `DBCompactionTest.VerifyAllOutputFlagsWithoutParanoidFileChecks` - Round-trip verified: tests FAIL without fix, PASS with fix

meta-codesync · 2026-03-12T21:39:57Z

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D96371769.

meta-codesync · 2026-03-12T23:40:44Z

@xingbowang merged this pull request in e44a99f.

Summary: Fix leaked table cache entries that cause `TEST_VerifyNoObsoleteFilesCached` assertion failure during `DB::Close()` in ASAN crash test builds (T258745630): ``` File 126519 is not live nor quarantined Assertion `cached_file_is_live_or_quar' failed. ``` When a compaction fails *after* `VerifyOutputFiles` succeeds (e.g., at `VerifyCompactionRecordCounts`), the overall `compact_->status` is set to error but each subcompaction's individual `status` remains OK. `SubcompactionState::Cleanup` only checked the individual subcompaction status, so it skipped calling `ReleaseObsolete` on the output files' table cache entries — leaking them. Normally, `Close()`'s backstop (`FindObsoleteFiles(force=true)` + `PurgeObsoleteFiles`) would catch this by finding the orphan file on disk, evicting the cache entry, and deleting the file. However, `FindObsoleteFiles` calls `FaultInjectionTestFS::GetChildren` which can fail under metadata read fault injection (`--open_metadata_read_fault_one_in=8`). The error is silently ignored (`s.PermitUncheckedError()` in db_impl_files.cc:199), so the orphan file is never found and the leaked cache entry is never evicted. The `Cleanup` bug is latent and predates recent changes. PR #14433 added `verify_output_flags` randomization to the crash test, and PR #14456 fixed false-positive corruptions that #14433 caused. Before #14456, `VerifyOutputFiles` would produce false corruption errors that *accidentally prevented the leak* by setting the subcompaction status to non-OK. ### How it triggers **Step 1 — VerifyOutputFiles adds cache entries for compaction output files:** ``` CompactionJob::Run() RunSubcompactions() // all subcompactions succeed // output file 12 written to disk // each sub_compact.status = OK SyncOutputDirectories() // status = OK VerifyOutputFiles() for each output_file: table_cache()->NewIterator(output_file.meta) FindTable() cache->Insert(file_number=12) // <<< ENTRY ADDED TO TABLE CACHE return iterator holding handle delete iter // releases handle, entry stays in LRU // status = OK ``` **Step 2 — A post-verification step fails, overall status set but NOT subcompaction status:** ``` CompactionJob::Run() VerifyCompactionRecordCounts() // returns Status::Corruption(...) FinalizeCompactionRun(status=Corruption) compact_->status = Corruption // <<< OVERALL status = error // each sub_compact.status is still OK! ``` **Step 3 — Install skips InstallCompactionResults (file 12 never enters a Version):** ``` CompactionJob::Install() status = compact_->status // Corruption if (status.ok()) // FALSE InstallCompactionResults() // <<< SKIPPED — file 12 not in any version ``` **Step 4 — CleanupCompaction skips ReleaseObsolete (THE BUG):** ``` CompactionJob::CleanupCompaction() for each sub_compact: sub_compact.Cleanup(table_cache) if (!status.ok()) // checks sub_compact.status = OK // <<< FALSE — individual status is OK! ReleaseObsolete(...) // <<< NEVER CALLED — cache entry leaked! ``` **Step 5 — Metadata read fault prevents backstop from finding the orphan:** `FindObsoleteFiles(force=true)` calls `GetChildren()` to scan the DB directory. `FaultInjectionTestFS::GetChildren` injects a metadata read error (`--open_metadata_read_fault_one_in=8`). The error is silently ignored (`s.PermitUncheckedError()`), so the directory listing is empty and the orphan file is never found. This happens both in the post-compaction cleanup (`BackgroundCallCompaction`) and in `Close()`'s backstop: ``` FindObsoleteFiles(force=true) fs->GetChildren(path, ...) // returns IOError (fault injected) s.PermitUncheckedError(); // error silently ignored // files vector is empty — orphan file 12 not found PurgeObsoleteFiles() // nothing to do — no Evict called ``` **Step 6 — Assertion fires during Close():** ``` CloseHelper() FindObsoleteFiles(force=true) // GetChildren fails again → orphan missed PurgeObsoleteFiles() // nothing to evict TEST_VerifyNoObsoleteFilesCached() for each cache entry: file_number = 12 live_and_quar_files.find(12) == end() // NOT in any version! >>> assert(cached_file_is_live_or_quar) FAILS <<< ``` **Crash test call stack:** ``` frame #9: __assert_fail_base("cached_file_is_live_or_quar", "db_impl_debug.cc", 389) frame #11: DBImpl::TEST_VerifyNoObsoleteFilesCached()::lambda // finds leaked entry frame #18: LRUCacheShard::ApplyToSomeEntries(...) // iterating cache shard frame #19: ShardedCache::ApplyToAllEntries(...) // iterating all shards frame #20: DBImpl::TEST_VerifyNoObsoleteFilesCached() // the verification frame #21: DBImpl::CloseHelper() // during Close frame #22: DBImpl::CloseImpl() frame #23: DBImpl::Close() frame #24: StressTest::Reopen() // crash test reopen frame #25: StressTest::OperateDb() // worker thread ``` ### Fix Pass the overall `compact_->status` to `SubcompactionState::Cleanup` and call `ReleaseObsolete` when *either* the subcompaction status or the overall status is non-OK: ```cpp // Before: if (!status.ok()) { ReleaseObsolete(...); } // After: if (!status.ok() || !overall_status.ok()) { ReleaseObsolete(...); } ``` ## Key changes - **`SubcompactionState::Cleanup`**: Now takes an `overall_status` parameter and calls `ReleaseObsolete` when *either* the subcompaction status or the overall compaction status is non-OK. - **`CompactionJob::CleanupCompaction`**: Passes `compact_->status` (the overall status) to each subcompaction's `Cleanup`. - **Sync point**: Added `CompactionJob::Run():AfterVerifyOutputFiles` for error injection in tests. - **Unit test**: `DBCompactionTest.LeakedTableCacheEntryOnCompactionFailure` uses `FaultInjectionTestFS` to reproduce the crash test scenario — injects error after `VerifyOutputFiles` and deactivates the filesystem so `GetChildren` fails in `FindObsoleteFiles`, preventing the backstop from evicting the leaked cache entry. Pull Request resolved: #14469 Test Plan: - `DBCompactionTest.LeakedTableCacheEntryOnCompactionFailure`: - **Without fix (ASAN)**: assertion fires during `Close()` — `File 12 is not live nor quarantined` - **With fix (ASAN)**: passes — `ReleaseObsolete` properly cleans up the cache entry - **With fix (non-ASAN)**: passes ``` $ COMPILE_WITH_ASAN=1 make -j db_compaction_test $ ./db_compaction_test --gtest_filter="DBCompactionTest.LeakedTableCacheEntryOnCompactionFailure" [ PASSED ] 1 test. ``` Reviewed By: xingbowang Differential Revision: D97190944 Pulled By: joshkang97 fbshipit-source-id: fdfd481cc1e192803cfb7d64052ccb9162c21b94

meta-cla bot added the CLA Signed label Mar 12, 2026

jaykorean approved these changes Mar 12, 2026

View reviewed changes

xingbowang force-pushed the 2026_03_12_file_cksum branch from cd98976 to 3a41cb7 Compare March 12, 2026 21:13

meta-codesync bot closed this in e44a99f Mar 12, 2026

facebook-github-tools bot added the Merged label Mar 12, 2026

joshkang97 mentioned this pull request Mar 20, 2026

Fix table handle leak #14469

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix false positive corruption in compaction output verification#14456

Fix false positive corruption in compaction output verification#14456
xingbowang wants to merge 1 commit intofacebook:mainfrom
xingbowang:2026_03_12_file_cksum

xingbowang commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

jaykorean left a comment

Uh oh!

jaykorean Mar 12, 2026

Uh oh!

xingbowang Mar 12, 2026

Uh oh!

meta-codesync bot commented Mar 12, 2026

Uh oh!

meta-codesync bot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xingbowang commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ clang-tidy: No findings on changed lines

Uh oh!

jaykorean left a comment

Choose a reason for hiding this comment

Uh oh!

jaykorean Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

xingbowang Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

meta-codesync bot commented Mar 12, 2026

Uh oh!

meta-codesync bot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Mar 12, 2026 •

edited

Loading