Add option to verify file checksum of output files#14433
Closed
jaykorean wants to merge 4 commits intofacebook:mainfrom
Closed
Add option to verify file checksum of output files#14433jaykorean wants to merge 4 commits intofacebook:mainfrom
jaykorean wants to merge 4 commits intofacebook:mainfrom
Conversation
✅ clang-tidy: No findings on changed linesCompleted in 386.9s. |
|
@jaykorean has imported this pull request. If you are a Meta employee, you can view this in D95648000. |
58f1056 to
96b452f
Compare
|
@jaykorean has imported this pull request. If you are a Meta employee, you can view this in D95648000. |
Contributor
|
Can we add this to stress/crash test? |
|
@jaykorean has imported this pull request. If you are a Meta employee, you can view this in D95648000. |
|
@jaykorean merged this pull request in 5494bc1. |
xingbowang
added a commit
to xingbowang/rocksdb
that referenced
this pull request
Mar 12, 2026
Summary: Fix a bug where `VerifyOutputFiles()` produces false positive "Key-value checksum of compaction output doesn't match what was computed when written" errors when `verify_output_flags` includes `kVerifyIteration` but `paranoid_file_checks` is false. The root cause is a hash enable flag mismatch between writing and verification: - During compaction writing (`OpenCompactionOutputFile`), the `OutputValidator` hash computation was gated solely by `paranoid_file_checks_`. When false, `enable_hash=false` and the hash stays at 0. - During verification (`VerifyOutputFiles`), a new `OutputValidator` is always created with `enable_hash=true`, computing a non-zero hash. - `CompareValidator()` then compares 0 vs non-zero, producing a false positive corruption. This was exposed by the crash test randomization of `verify_output_flags` added in facebook#14433. Before that change, `verify_output_flags` was always 0, so `kVerifyIteration` was only exercised via `paranoid_file_checks=true` (which correctly enabled the hash during writing). The fix ensures hash computation is enabled during writing whenever either `paranoid_file_checks_` is true OR `verify_output_flags` includes `kVerifyIteration`. Test Plan: - Added `DBCompactionTest.VerifyIterationWithoutParanoidFileChecks` - Added `DBCompactionTest.VerifyAllOutputFlagsWithoutParanoidFileChecks` - Round-trip verified: tests FAIL without fix, PASS with fix
xingbowang
added a commit
to xingbowang/rocksdb
that referenced
this pull request
Mar 12, 2026
Summary: Fix a bug where `VerifyOutputFiles()` produces false positive "Key-value checksum of compaction output doesn't match what was computed when written" errors when `verify_output_flags` includes `kVerifyIteration` but `paranoid_file_checks` is false. The root cause is a hash enable flag mismatch between writing and verification: - During compaction writing (`OpenCompactionOutputFile`), the `OutputValidator` hash computation was gated solely by `paranoid_file_checks_`. When false, `enable_hash=false` and the hash stays at 0. - During verification (`VerifyOutputFiles`), a new `OutputValidator` is always created with `enable_hash=true`, computing a non-zero hash. - `CompareValidator()` then compares 0 vs non-zero, producing a false positive corruption. This was exposed by the crash test randomization of `verify_output_flags` added in facebook#14433. Before that change, `verify_output_flags` was always 0, so `kVerifyIteration` was only exercised via `paranoid_file_checks=true` (which correctly enabled the hash during writing). The fix ensures hash computation is enabled during writing whenever either `paranoid_file_checks_` is true OR `verify_output_flags` includes `kVerifyIteration`. Test Plan: - Added `DBCompactionTest.VerifyIterationWithoutParanoidFileChecks` - Added `DBCompactionTest.VerifyAllOutputFlagsWithoutParanoidFileChecks` - Round-trip verified: tests FAIL without fix, PASS with fix
meta-codesync bot
pushed a commit
that referenced
this pull request
Mar 12, 2026
Summary: Fix a bug where `VerifyOutputFiles()` produces false positive "Key-value checksum of compaction output doesn't match what was computed when written" errors when `verify_output_flags` includes `kVerifyIteration` but `paranoid_file_checks` is false. The root cause is a hash enable flag mismatch between writing and verification: - During compaction writing (`OpenCompactionOutputFile`), the `OutputValidator` hash computation was gated solely by `paranoid_file_checks_`. When false, `enable_hash=false` and the hash stays at 0. - During verification (`VerifyOutputFiles`), a new `OutputValidator` is always created with `enable_hash=true`, computing a non-zero hash. - `CompareValidator()` then compares 0 vs non-zero, producing a false positive corruption. This was exposed by the crash test randomization of `verify_output_flags` added in #14433. Before that change, `verify_output_flags` was always 0, so `kVerifyIteration` was only exercised via `paranoid_file_checks=true` (which correctly enabled the hash during writing). The fix ensures hash computation is enabled during writing whenever either `paranoid_file_checks_` is true OR `verify_output_flags` includes `kVerifyIteration`. Pull Request resolved: #14456 Test Plan: - Added `DBCompactionTest.VerifyIterationWithoutParanoidFileChecks` - Added `DBCompactionTest.VerifyAllOutputFlagsWithoutParanoidFileChecks` - Round-trip verified: tests FAIL without fix, PASS with fix Reviewed By: jaykorean Differential Revision: D96371769 Pulled By: xingbowang fbshipit-source-id: 2f7406327496c7b541e4fa2668894df89eb813e8
Closed
meta-codesync bot
pushed a commit
that referenced
this pull request
Mar 23, 2026
Summary: Fix leaked table cache entries that cause `TEST_VerifyNoObsoleteFilesCached` assertion failure during `DB::Close()` in ASAN crash test builds (T258745630): ``` File 126519 is not live nor quarantined Assertion `cached_file_is_live_or_quar' failed. ``` When a compaction fails *after* `VerifyOutputFiles` succeeds (e.g., at `VerifyCompactionRecordCounts`), the overall `compact_->status` is set to error but each subcompaction's individual `status` remains OK. `SubcompactionState::Cleanup` only checked the individual subcompaction status, so it skipped calling `ReleaseObsolete` on the output files' table cache entries — leaking them. Normally, `Close()`'s backstop (`FindObsoleteFiles(force=true)` + `PurgeObsoleteFiles`) would catch this by finding the orphan file on disk, evicting the cache entry, and deleting the file. However, `FindObsoleteFiles` calls `FaultInjectionTestFS::GetChildren` which can fail under metadata read fault injection (`--open_metadata_read_fault_one_in=8`). The error is silently ignored (`s.PermitUncheckedError()` in db_impl_files.cc:199), so the orphan file is never found and the leaked cache entry is never evicted. The `Cleanup` bug is latent and predates recent changes. PR #14433 added `verify_output_flags` randomization to the crash test, and PR #14456 fixed false-positive corruptions that #14433 caused. Before #14456, `VerifyOutputFiles` would produce false corruption errors that *accidentally prevented the leak* by setting the subcompaction status to non-OK. ### How it triggers **Step 1 — VerifyOutputFiles adds cache entries for compaction output files:** ``` CompactionJob::Run() RunSubcompactions() // all subcompactions succeed // output file 12 written to disk // each sub_compact.status = OK SyncOutputDirectories() // status = OK VerifyOutputFiles() for each output_file: table_cache()->NewIterator(output_file.meta) FindTable() cache->Insert(file_number=12) // <<< ENTRY ADDED TO TABLE CACHE return iterator holding handle delete iter // releases handle, entry stays in LRU // status = OK ``` **Step 2 — A post-verification step fails, overall status set but NOT subcompaction status:** ``` CompactionJob::Run() VerifyCompactionRecordCounts() // returns Status::Corruption(...) FinalizeCompactionRun(status=Corruption) compact_->status = Corruption // <<< OVERALL status = error // each sub_compact.status is still OK! ``` **Step 3 — Install skips InstallCompactionResults (file 12 never enters a Version):** ``` CompactionJob::Install() status = compact_->status // Corruption if (status.ok()) // FALSE InstallCompactionResults() // <<< SKIPPED — file 12 not in any version ``` **Step 4 — CleanupCompaction skips ReleaseObsolete (THE BUG):** ``` CompactionJob::CleanupCompaction() for each sub_compact: sub_compact.Cleanup(table_cache) if (!status.ok()) // checks sub_compact.status = OK // <<< FALSE — individual status is OK! ReleaseObsolete(...) // <<< NEVER CALLED — cache entry leaked! ``` **Step 5 — Metadata read fault prevents backstop from finding the orphan:** `FindObsoleteFiles(force=true)` calls `GetChildren()` to scan the DB directory. `FaultInjectionTestFS::GetChildren` injects a metadata read error (`--open_metadata_read_fault_one_in=8`). The error is silently ignored (`s.PermitUncheckedError()`), so the directory listing is empty and the orphan file is never found. This happens both in the post-compaction cleanup (`BackgroundCallCompaction`) and in `Close()`'s backstop: ``` FindObsoleteFiles(force=true) fs->GetChildren(path, ...) // returns IOError (fault injected) s.PermitUncheckedError(); // error silently ignored // files vector is empty — orphan file 12 not found PurgeObsoleteFiles() // nothing to do — no Evict called ``` **Step 6 — Assertion fires during Close():** ``` CloseHelper() FindObsoleteFiles(force=true) // GetChildren fails again → orphan missed PurgeObsoleteFiles() // nothing to evict TEST_VerifyNoObsoleteFilesCached() for each cache entry: file_number = 12 live_and_quar_files.find(12) == end() // NOT in any version! >>> assert(cached_file_is_live_or_quar) FAILS <<< ``` **Crash test call stack:** ``` frame #9: __assert_fail_base("cached_file_is_live_or_quar", "db_impl_debug.cc", 389) frame #11: DBImpl::TEST_VerifyNoObsoleteFilesCached()::lambda // finds leaked entry frame #18: LRUCacheShard::ApplyToSomeEntries(...) // iterating cache shard frame #19: ShardedCache::ApplyToAllEntries(...) // iterating all shards frame #20: DBImpl::TEST_VerifyNoObsoleteFilesCached() // the verification frame #21: DBImpl::CloseHelper() // during Close frame #22: DBImpl::CloseImpl() frame #23: DBImpl::Close() frame #24: StressTest::Reopen() // crash test reopen frame #25: StressTest::OperateDb() // worker thread ``` ### Fix Pass the overall `compact_->status` to `SubcompactionState::Cleanup` and call `ReleaseObsolete` when *either* the subcompaction status or the overall status is non-OK: ```cpp // Before: if (!status.ok()) { ReleaseObsolete(...); } // After: if (!status.ok() || !overall_status.ok()) { ReleaseObsolete(...); } ``` ## Key changes - **`SubcompactionState::Cleanup`**: Now takes an `overall_status` parameter and calls `ReleaseObsolete` when *either* the subcompaction status or the overall compaction status is non-OK. - **`CompactionJob::CleanupCompaction`**: Passes `compact_->status` (the overall status) to each subcompaction's `Cleanup`. - **Sync point**: Added `CompactionJob::Run():AfterVerifyOutputFiles` for error injection in tests. - **Unit test**: `DBCompactionTest.LeakedTableCacheEntryOnCompactionFailure` uses `FaultInjectionTestFS` to reproduce the crash test scenario — injects error after `VerifyOutputFiles` and deactivates the filesystem so `GetChildren` fails in `FindObsoleteFiles`, preventing the backstop from evicting the leaked cache entry. Pull Request resolved: #14469 Test Plan: - `DBCompactionTest.LeakedTableCacheEntryOnCompactionFailure`: - **Without fix (ASAN)**: assertion fires during `Close()` — `File 12 is not live nor quarantined` - **With fix (ASAN)**: passes — `ReleaseObsolete` properly cleans up the cache entry - **With fix (non-ASAN)**: passes ``` $ COMPILE_WITH_ASAN=1 make -j db_compaction_test $ ./db_compaction_test --gtest_filter="DBCompactionTest.LeakedTableCacheEntryOnCompactionFailure" [ PASSED ] 1 test. ``` Reviewed By: xingbowang Differential Revision: D97190944 Pulled By: joshkang97 fbshipit-source-id: fdfd481cc1e192803cfb7d64052ccb9162c21b94
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
One of the follow ups from #14103. Users will have the option to verify file checksums for all compaction output files before they are installed. This feature helps prevent corrupted SST files from being added to the LSM tree.
Test Plan
Unit test added