Skip to content

Fix false positive corruption in compaction output verification#14456

Closed
xingbowang wants to merge 1 commit intofacebook:mainfrom
xingbowang:2026_03_12_file_cksum
Closed

Fix false positive corruption in compaction output verification#14456
xingbowang wants to merge 1 commit intofacebook:mainfrom
xingbowang:2026_03_12_file_cksum

Conversation

@xingbowang
Copy link
Copy Markdown
Contributor

Summary:
Fix a bug where VerifyOutputFiles() produces false positive "Key-value checksum of compaction output doesn't match what was computed when written" errors when verify_output_flags includes kVerifyIteration but paranoid_file_checks is false.

The root cause is a hash enable flag mismatch between writing and verification:

  • During compaction writing (OpenCompactionOutputFile), the OutputValidator hash computation was gated solely by paranoid_file_checks_. When false, enable_hash=false and the hash stays at 0.
  • During verification (VerifyOutputFiles), a new OutputValidator is always created with enable_hash=true, computing a non-zero hash.
  • CompareValidator() then compares 0 vs non-zero, producing a false positive corruption.

This was exposed by the crash test randomization of verify_output_flags added in #14433. Before that change, verify_output_flags was always 0, so kVerifyIteration was only exercised via paranoid_file_checks=true (which correctly enabled the hash during writing).

The fix ensures hash computation is enabled during writing whenever either paranoid_file_checks_ is true OR verify_output_flags includes kVerifyIteration.

Test Plan:

  • Added DBCompactionTest.VerifyIterationWithoutParanoidFileChecks
  • Added DBCompactionTest.VerifyAllOutputFlagsWithoutParanoidFileChecks
  • Round-trip verified: tests FAIL without fix, PASS with fix

@meta-cla meta-cla bot added the CLA Signed label Mar 12, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 12, 2026

✅ clang-tidy: No findings on changed lines

Completed in 275.6s.

Copy link
Copy Markdown
Contributor

@jaykorean jaykorean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one nit comment. Thanks for fixing this!

for (size_t i = 0; i < output_files.size(); i++) {
FileMetaData file_copy = output_files[i];

bool enable_output_hash =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this bool can be checked outside of for loop and can be const

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Thank you.

Summary:
Fix a bug where `VerifyOutputFiles()` produces false positive "Key-value
checksum of compaction output doesn't match what was computed when written"
errors when `verify_output_flags` includes `kVerifyIteration` but
`paranoid_file_checks` is false.

The root cause is a hash enable flag mismatch between writing and
verification:

- During compaction writing (`OpenCompactionOutputFile`), the
  `OutputValidator` hash computation was gated solely by
  `paranoid_file_checks_`. When false, `enable_hash=false` and the hash
  stays at 0.
- During verification (`VerifyOutputFiles`), a new `OutputValidator` is
  always created with `enable_hash=true`, computing a non-zero hash.
- `CompareValidator()` then compares 0 vs non-zero, producing a false
  positive corruption.

This was exposed by the crash test randomization of `verify_output_flags`
added in facebook#14433. Before that change, `verify_output_flags` was always 0,
so `kVerifyIteration` was only exercised via `paranoid_file_checks=true`
(which correctly enabled the hash during writing).

The fix ensures hash computation is enabled during writing whenever either
`paranoid_file_checks_` is true OR `verify_output_flags` includes
`kVerifyIteration`.

Test Plan:
- Added `DBCompactionTest.VerifyIterationWithoutParanoidFileChecks`
- Added `DBCompactionTest.VerifyAllOutputFlagsWithoutParanoidFileChecks`
- Round-trip verified: tests FAIL without fix, PASS with fix
@xingbowang xingbowang force-pushed the 2026_03_12_file_cksum branch from cd98976 to 3a41cb7 Compare March 12, 2026 21:13
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 12, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D96371769.

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 12, 2026

@xingbowang merged this pull request in e44a99f.

meta-codesync bot pushed a commit that referenced this pull request Mar 23, 2026
Summary:
Fix leaked table cache entries that cause `TEST_VerifyNoObsoleteFilesCached` assertion failure during `DB::Close()` in ASAN crash test builds (T258745630):
```
File 126519 is not live nor quarantined
Assertion `cached_file_is_live_or_quar' failed.
```

When a compaction fails *after* `VerifyOutputFiles` succeeds (e.g., at `VerifyCompactionRecordCounts`), the overall `compact_->status` is set to error but each subcompaction's individual `status` remains OK. `SubcompactionState::Cleanup` only checked the individual subcompaction status, so it skipped calling `ReleaseObsolete` on the output files' table cache entries — leaking them.

Normally, `Close()`'s backstop (`FindObsoleteFiles(force=true)` + `PurgeObsoleteFiles`) would catch this by finding the orphan file on disk, evicting the cache entry, and deleting the file. However, `FindObsoleteFiles` calls `FaultInjectionTestFS::GetChildren` which can fail under metadata read fault injection (`--open_metadata_read_fault_one_in=8`). The error is silently ignored (`s.PermitUncheckedError()` in db_impl_files.cc:199), so the orphan file is never found and the leaked cache entry is never evicted.

The `Cleanup` bug is latent and predates recent changes. PR #14433 added `verify_output_flags` randomization to the crash test, and PR #14456 fixed false-positive corruptions that #14433 caused. Before #14456, `VerifyOutputFiles` would produce false corruption errors that *accidentally prevented the leak* by setting the subcompaction status to non-OK.

### How it triggers

**Step 1 — VerifyOutputFiles adds cache entries for compaction output files:**
```
CompactionJob::Run()
  RunSubcompactions()              // all subcompactions succeed
                                   // output file 12 written to disk
                                   // each sub_compact.status = OK
  SyncOutputDirectories()          // status = OK
  VerifyOutputFiles()
    for each output_file:
      table_cache()->NewIterator(output_file.meta)
        FindTable()
          cache->Insert(file_number=12)   // <<< ENTRY ADDED TO TABLE CACHE
        return iterator holding handle
      delete iter                          // releases handle, entry stays in LRU
    // status = OK
```

**Step 2 — A post-verification step fails, overall status set but NOT subcompaction status:**
```
CompactionJob::Run()
  VerifyCompactionRecordCounts()   // returns Status::Corruption(...)
  FinalizeCompactionRun(status=Corruption)
    compact_->status = Corruption  // <<< OVERALL status = error
    // each sub_compact.status is still OK!
```

**Step 3 — Install skips InstallCompactionResults (file 12 never enters a Version):**
```
CompactionJob::Install()
  status = compact_->status        // Corruption
  if (status.ok())                 // FALSE
    InstallCompactionResults()     // <<< SKIPPED — file 12 not in any version
```

**Step 4 — CleanupCompaction skips ReleaseObsolete (THE BUG):**
```
CompactionJob::CleanupCompaction()
  for each sub_compact:
    sub_compact.Cleanup(table_cache)
      if (!status.ok())            // checks sub_compact.status = OK
                                   // <<< FALSE — individual status is OK!
        ReleaseObsolete(...)       // <<< NEVER CALLED — cache entry leaked!
```

**Step 5 — Metadata read fault prevents backstop from finding the orphan:**

`FindObsoleteFiles(force=true)` calls `GetChildren()` to scan the DB directory.
`FaultInjectionTestFS::GetChildren` injects a metadata read error
(`--open_metadata_read_fault_one_in=8`). The error is silently ignored
(`s.PermitUncheckedError()`), so the directory listing is empty and the orphan
file is never found. This happens both in the post-compaction cleanup
(`BackgroundCallCompaction`) and in `Close()`'s backstop:
```
FindObsoleteFiles(force=true)
  fs->GetChildren(path, ...)       // returns IOError (fault injected)
  s.PermitUncheckedError();        // error silently ignored
  // files vector is empty — orphan file 12 not found
PurgeObsoleteFiles()               // nothing to do — no Evict called
```

**Step 6 — Assertion fires during Close():**
```
CloseHelper()
  FindObsoleteFiles(force=true)    // GetChildren fails again → orphan missed
  PurgeObsoleteFiles()             // nothing to evict
  TEST_VerifyNoObsoleteFilesCached()
    for each cache entry:
      file_number = 12
      live_and_quar_files.find(12) == end()  // NOT in any version!
      >>> assert(cached_file_is_live_or_quar) FAILS <<<
```

**Crash test call stack:**
```
frame #9:  __assert_fail_base("cached_file_is_live_or_quar", "db_impl_debug.cc", 389)
frame #11: DBImpl::TEST_VerifyNoObsoleteFilesCached()::lambda  // finds leaked entry
frame #18: LRUCacheShard::ApplyToSomeEntries(...)              // iterating cache shard
frame #19: ShardedCache::ApplyToAllEntries(...)                // iterating all shards
frame #20: DBImpl::TEST_VerifyNoObsoleteFilesCached()          // the verification
frame #21: DBImpl::CloseHelper()                               // during Close
frame #22: DBImpl::CloseImpl()
frame #23: DBImpl::Close()
frame #24: StressTest::Reopen()                                // crash test reopen
frame #25: StressTest::OperateDb()                             // worker thread
```

### Fix

Pass the overall `compact_->status` to `SubcompactionState::Cleanup` and call
`ReleaseObsolete` when *either* the subcompaction status or the overall status
is non-OK:
```cpp
// Before:
if (!status.ok()) { ReleaseObsolete(...); }

// After:
if (!status.ok() || !overall_status.ok()) { ReleaseObsolete(...); }
```

## Key changes

- **`SubcompactionState::Cleanup`**: Now takes an `overall_status` parameter and calls `ReleaseObsolete` when *either* the subcompaction status or the overall compaction status is non-OK.
- **`CompactionJob::CleanupCompaction`**: Passes `compact_->status` (the overall status) to each subcompaction's `Cleanup`.
- **Sync point**: Added `CompactionJob::Run():AfterVerifyOutputFiles` for error injection in tests.
- **Unit test**: `DBCompactionTest.LeakedTableCacheEntryOnCompactionFailure` uses `FaultInjectionTestFS` to reproduce the crash test scenario — injects error after `VerifyOutputFiles` and deactivates the filesystem so `GetChildren` fails in `FindObsoleteFiles`, preventing the backstop from evicting the leaked cache entry.

Pull Request resolved: #14469

Test Plan:
- `DBCompactionTest.LeakedTableCacheEntryOnCompactionFailure`:
  - **Without fix (ASAN)**: assertion fires during `Close()` — `File 12 is not live nor quarantined`
  - **With fix (ASAN)**: passes — `ReleaseObsolete` properly cleans up the cache entry
  - **With fix (non-ASAN)**: passes
```
$ COMPILE_WITH_ASAN=1 make -j db_compaction_test
$ ./db_compaction_test --gtest_filter="DBCompactionTest.LeakedTableCacheEntryOnCompactionFailure"
[  PASSED  ] 1 test.
```

Reviewed By: xingbowang

Differential Revision: D97190944

Pulled By: joshkang97

fbshipit-source-id: fdfd481cc1e192803cfb7d64052ccb9162c21b94
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants