Fix false positive corruption in compaction output verification#14456
Closed
xingbowang wants to merge 1 commit intofacebook:mainfrom
Closed
Fix false positive corruption in compaction output verification#14456xingbowang wants to merge 1 commit intofacebook:mainfrom
xingbowang wants to merge 1 commit intofacebook:mainfrom
Conversation
✅ clang-tidy: No findings on changed linesCompleted in 275.6s. |
jaykorean
approved these changes
Mar 12, 2026
Contributor
jaykorean
left a comment
There was a problem hiding this comment.
Left one nit comment. Thanks for fixing this!
db/compaction/compaction_job.cc
Outdated
| for (size_t i = 0; i < output_files.size(); i++) { | ||
| FileMetaData file_copy = output_files[i]; | ||
|
|
||
| bool enable_output_hash = |
Contributor
There was a problem hiding this comment.
nit: this bool can be checked outside of for loop and can be const
Contributor
Author
There was a problem hiding this comment.
Good catch. Thank you.
Summary: Fix a bug where `VerifyOutputFiles()` produces false positive "Key-value checksum of compaction output doesn't match what was computed when written" errors when `verify_output_flags` includes `kVerifyIteration` but `paranoid_file_checks` is false. The root cause is a hash enable flag mismatch between writing and verification: - During compaction writing (`OpenCompactionOutputFile`), the `OutputValidator` hash computation was gated solely by `paranoid_file_checks_`. When false, `enable_hash=false` and the hash stays at 0. - During verification (`VerifyOutputFiles`), a new `OutputValidator` is always created with `enable_hash=true`, computing a non-zero hash. - `CompareValidator()` then compares 0 vs non-zero, producing a false positive corruption. This was exposed by the crash test randomization of `verify_output_flags` added in facebook#14433. Before that change, `verify_output_flags` was always 0, so `kVerifyIteration` was only exercised via `paranoid_file_checks=true` (which correctly enabled the hash during writing). The fix ensures hash computation is enabled during writing whenever either `paranoid_file_checks_` is true OR `verify_output_flags` includes `kVerifyIteration`. Test Plan: - Added `DBCompactionTest.VerifyIterationWithoutParanoidFileChecks` - Added `DBCompactionTest.VerifyAllOutputFlagsWithoutParanoidFileChecks` - Round-trip verified: tests FAIL without fix, PASS with fix
cd98976 to
3a41cb7
Compare
|
@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D96371769. |
|
@xingbowang merged this pull request in e44a99f. |
Closed
meta-codesync bot
pushed a commit
that referenced
this pull request
Mar 23, 2026
Summary: Fix leaked table cache entries that cause `TEST_VerifyNoObsoleteFilesCached` assertion failure during `DB::Close()` in ASAN crash test builds (T258745630): ``` File 126519 is not live nor quarantined Assertion `cached_file_is_live_or_quar' failed. ``` When a compaction fails *after* `VerifyOutputFiles` succeeds (e.g., at `VerifyCompactionRecordCounts`), the overall `compact_->status` is set to error but each subcompaction's individual `status` remains OK. `SubcompactionState::Cleanup` only checked the individual subcompaction status, so it skipped calling `ReleaseObsolete` on the output files' table cache entries — leaking them. Normally, `Close()`'s backstop (`FindObsoleteFiles(force=true)` + `PurgeObsoleteFiles`) would catch this by finding the orphan file on disk, evicting the cache entry, and deleting the file. However, `FindObsoleteFiles` calls `FaultInjectionTestFS::GetChildren` which can fail under metadata read fault injection (`--open_metadata_read_fault_one_in=8`). The error is silently ignored (`s.PermitUncheckedError()` in db_impl_files.cc:199), so the orphan file is never found and the leaked cache entry is never evicted. The `Cleanup` bug is latent and predates recent changes. PR #14433 added `verify_output_flags` randomization to the crash test, and PR #14456 fixed false-positive corruptions that #14433 caused. Before #14456, `VerifyOutputFiles` would produce false corruption errors that *accidentally prevented the leak* by setting the subcompaction status to non-OK. ### How it triggers **Step 1 — VerifyOutputFiles adds cache entries for compaction output files:** ``` CompactionJob::Run() RunSubcompactions() // all subcompactions succeed // output file 12 written to disk // each sub_compact.status = OK SyncOutputDirectories() // status = OK VerifyOutputFiles() for each output_file: table_cache()->NewIterator(output_file.meta) FindTable() cache->Insert(file_number=12) // <<< ENTRY ADDED TO TABLE CACHE return iterator holding handle delete iter // releases handle, entry stays in LRU // status = OK ``` **Step 2 — A post-verification step fails, overall status set but NOT subcompaction status:** ``` CompactionJob::Run() VerifyCompactionRecordCounts() // returns Status::Corruption(...) FinalizeCompactionRun(status=Corruption) compact_->status = Corruption // <<< OVERALL status = error // each sub_compact.status is still OK! ``` **Step 3 — Install skips InstallCompactionResults (file 12 never enters a Version):** ``` CompactionJob::Install() status = compact_->status // Corruption if (status.ok()) // FALSE InstallCompactionResults() // <<< SKIPPED — file 12 not in any version ``` **Step 4 — CleanupCompaction skips ReleaseObsolete (THE BUG):** ``` CompactionJob::CleanupCompaction() for each sub_compact: sub_compact.Cleanup(table_cache) if (!status.ok()) // checks sub_compact.status = OK // <<< FALSE — individual status is OK! ReleaseObsolete(...) // <<< NEVER CALLED — cache entry leaked! ``` **Step 5 — Metadata read fault prevents backstop from finding the orphan:** `FindObsoleteFiles(force=true)` calls `GetChildren()` to scan the DB directory. `FaultInjectionTestFS::GetChildren` injects a metadata read error (`--open_metadata_read_fault_one_in=8`). The error is silently ignored (`s.PermitUncheckedError()`), so the directory listing is empty and the orphan file is never found. This happens both in the post-compaction cleanup (`BackgroundCallCompaction`) and in `Close()`'s backstop: ``` FindObsoleteFiles(force=true) fs->GetChildren(path, ...) // returns IOError (fault injected) s.PermitUncheckedError(); // error silently ignored // files vector is empty — orphan file 12 not found PurgeObsoleteFiles() // nothing to do — no Evict called ``` **Step 6 — Assertion fires during Close():** ``` CloseHelper() FindObsoleteFiles(force=true) // GetChildren fails again → orphan missed PurgeObsoleteFiles() // nothing to evict TEST_VerifyNoObsoleteFilesCached() for each cache entry: file_number = 12 live_and_quar_files.find(12) == end() // NOT in any version! >>> assert(cached_file_is_live_or_quar) FAILS <<< ``` **Crash test call stack:** ``` frame #9: __assert_fail_base("cached_file_is_live_or_quar", "db_impl_debug.cc", 389) frame #11: DBImpl::TEST_VerifyNoObsoleteFilesCached()::lambda // finds leaked entry frame #18: LRUCacheShard::ApplyToSomeEntries(...) // iterating cache shard frame #19: ShardedCache::ApplyToAllEntries(...) // iterating all shards frame #20: DBImpl::TEST_VerifyNoObsoleteFilesCached() // the verification frame #21: DBImpl::CloseHelper() // during Close frame #22: DBImpl::CloseImpl() frame #23: DBImpl::Close() frame #24: StressTest::Reopen() // crash test reopen frame #25: StressTest::OperateDb() // worker thread ``` ### Fix Pass the overall `compact_->status` to `SubcompactionState::Cleanup` and call `ReleaseObsolete` when *either* the subcompaction status or the overall status is non-OK: ```cpp // Before: if (!status.ok()) { ReleaseObsolete(...); } // After: if (!status.ok() || !overall_status.ok()) { ReleaseObsolete(...); } ``` ## Key changes - **`SubcompactionState::Cleanup`**: Now takes an `overall_status` parameter and calls `ReleaseObsolete` when *either* the subcompaction status or the overall compaction status is non-OK. - **`CompactionJob::CleanupCompaction`**: Passes `compact_->status` (the overall status) to each subcompaction's `Cleanup`. - **Sync point**: Added `CompactionJob::Run():AfterVerifyOutputFiles` for error injection in tests. - **Unit test**: `DBCompactionTest.LeakedTableCacheEntryOnCompactionFailure` uses `FaultInjectionTestFS` to reproduce the crash test scenario — injects error after `VerifyOutputFiles` and deactivates the filesystem so `GetChildren` fails in `FindObsoleteFiles`, preventing the backstop from evicting the leaked cache entry. Pull Request resolved: #14469 Test Plan: - `DBCompactionTest.LeakedTableCacheEntryOnCompactionFailure`: - **Without fix (ASAN)**: assertion fires during `Close()` — `File 12 is not live nor quarantined` - **With fix (ASAN)**: passes — `ReleaseObsolete` properly cleans up the cache entry - **With fix (non-ASAN)**: passes ``` $ COMPILE_WITH_ASAN=1 make -j db_compaction_test $ ./db_compaction_test --gtest_filter="DBCompactionTest.LeakedTableCacheEntryOnCompactionFailure" [ PASSED ] 1 test. ``` Reviewed By: xingbowang Differential Revision: D97190944 Pulled By: joshkang97 fbshipit-source-id: fdfd481cc1e192803cfb7d64052ccb9162c21b94
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Fix a bug where
VerifyOutputFiles()produces false positive "Key-value checksum of compaction output doesn't match what was computed when written" errors whenverify_output_flagsincludeskVerifyIterationbutparanoid_file_checksis false.The root cause is a hash enable flag mismatch between writing and verification:
OpenCompactionOutputFile), theOutputValidatorhash computation was gated solely byparanoid_file_checks_. When false,enable_hash=falseand the hash stays at 0.VerifyOutputFiles), a newOutputValidatoris always created withenable_hash=true, computing a non-zero hash.CompareValidator()then compares 0 vs non-zero, producing a false positive corruption.This was exposed by the crash test randomization of
verify_output_flagsadded in #14433. Before that change,verify_output_flagswas always 0, sokVerifyIterationwas only exercised viaparanoid_file_checks=true(which correctly enabled the hash during writing).The fix ensures hash computation is enabled during writing whenever either
paranoid_file_checks_is true ORverify_output_flagsincludeskVerifyIteration.Test Plan:
DBCompactionTest.VerifyIterationWithoutParanoidFileChecksDBCompactionTest.VerifyAllOutputFlagsWithoutParanoidFileChecks