Skip to content

Fix table handle leak #14469

Closed
joshkang97 wants to merge 9 commits intofacebook:mainfrom
joshkang97:fix_leaked_files
Closed

Fix table handle leak #14469
joshkang97 wants to merge 9 commits intofacebook:mainfrom
joshkang97:fix_leaked_files

Conversation

@joshkang97
Copy link
Copy Markdown
Contributor

@joshkang97 joshkang97 commented Mar 18, 2026

Summary

Fix leaked table cache entries that cause TEST_VerifyNoObsoleteFilesCached assertion failure during DB::Close() in ASAN crash test builds (T258745630):

File 126519 is not live nor quarantined
Assertion `cached_file_is_live_or_quar' failed.

When a compaction fails after VerifyOutputFiles succeeds (e.g., at VerifyCompactionRecordCounts), the overall compact_->status is set to error but each subcompaction's individual status remains OK. SubcompactionState::Cleanup only checked the individual subcompaction status, so it skipped calling ReleaseObsolete on the output files' table cache entries — leaking them.

Normally, Close()'s backstop (FindObsoleteFiles(force=true) + PurgeObsoleteFiles) would catch this by finding the orphan file on disk, evicting the cache entry, and deleting the file. However, FindObsoleteFiles calls FaultInjectionTestFS::GetChildren which can fail under metadata read fault injection (--open_metadata_read_fault_one_in=8). The error is silently ignored (s.PermitUncheckedError() in db_impl_files.cc:199), so the orphan file is never found and the leaked cache entry is never evicted.

The Cleanup bug is latent and predates recent changes. PR #14433 added verify_output_flags randomization to the crash test, and PR #14456 fixed false-positive corruptions that #14433 caused. Before #14456, VerifyOutputFiles would produce false corruption errors that accidentally prevented the leak by setting the subcompaction status to non-OK.

How it triggers

Step 1 — VerifyOutputFiles adds cache entries for compaction output files:

CompactionJob::Run()
  RunSubcompactions()              // all subcompactions succeed
                                   // output file 12 written to disk
                                   // each sub_compact.status = OK
  SyncOutputDirectories()          // status = OK
  VerifyOutputFiles()
    for each output_file:
      table_cache()->NewIterator(output_file.meta)
        FindTable()
          cache->Insert(file_number=12)   // <<< ENTRY ADDED TO TABLE CACHE
        return iterator holding handle
      delete iter                          // releases handle, entry stays in LRU
    // status = OK

Step 2 — A post-verification step fails, overall status set but NOT subcompaction status:

CompactionJob::Run()
  VerifyCompactionRecordCounts()   // returns Status::Corruption(...)
  FinalizeCompactionRun(status=Corruption)
    compact_->status = Corruption  // <<< OVERALL status = error
    // each sub_compact.status is still OK!

Step 3 — Install skips InstallCompactionResults (file 12 never enters a Version):

CompactionJob::Install()
  status = compact_->status        // Corruption
  if (status.ok())                 // FALSE
    InstallCompactionResults()     // <<< SKIPPED — file 12 not in any version

Step 4 — CleanupCompaction skips ReleaseObsolete (THE BUG):

CompactionJob::CleanupCompaction()
  for each sub_compact:
    sub_compact.Cleanup(table_cache)
      if (!status.ok())            // checks sub_compact.status = OK
                                   // <<< FALSE — individual status is OK!
        ReleaseObsolete(...)       // <<< NEVER CALLED — cache entry leaked!

Step 5 — Metadata read fault prevents backstop from finding the orphan:

FindObsoleteFiles(force=true) calls GetChildren() to scan the DB directory.
FaultInjectionTestFS::GetChildren injects a metadata read error
(--open_metadata_read_fault_one_in=8). The error is silently ignored
(s.PermitUncheckedError()), so the directory listing is empty and the orphan
file is never found. This happens both in the post-compaction cleanup
(BackgroundCallCompaction) and in Close()'s backstop:

FindObsoleteFiles(force=true)
  fs->GetChildren(path, ...)       // returns IOError (fault injected)
  s.PermitUncheckedError();        // error silently ignored
  // files vector is empty — orphan file 12 not found
PurgeObsoleteFiles()               // nothing to do — no Evict called

Step 6 — Assertion fires during Close():

CloseHelper()
  FindObsoleteFiles(force=true)    // GetChildren fails again → orphan missed
  PurgeObsoleteFiles()             // nothing to evict
  TEST_VerifyNoObsoleteFilesCached()
    for each cache entry:
      file_number = 12
      live_and_quar_files.find(12) == end()  // NOT in any version!
      >>> assert(cached_file_is_live_or_quar) FAILS <<<

Crash test call stack:

frame #9:  __assert_fail_base("cached_file_is_live_or_quar", "db_impl_debug.cc", 389)
frame #11: DBImpl::TEST_VerifyNoObsoleteFilesCached()::lambda  // finds leaked entry
frame #18: LRUCacheShard::ApplyToSomeEntries(...)              // iterating cache shard
frame #19: ShardedCache::ApplyToAllEntries(...)                // iterating all shards
frame #20: DBImpl::TEST_VerifyNoObsoleteFilesCached()          // the verification
frame #21: DBImpl::CloseHelper()                               // during Close
frame #22: DBImpl::CloseImpl()
frame #23: DBImpl::Close()
frame #24: StressTest::Reopen()                                // crash test reopen
frame #25: StressTest::OperateDb()                             // worker thread

Fix

Pass the overall compact_->status to SubcompactionState::Cleanup and call
ReleaseObsolete when either the subcompaction status or the overall status
is non-OK:

// Before:
if (!status.ok()) { ReleaseObsolete(...); }

// After:
if (!status.ok() || !overall_status.ok()) { ReleaseObsolete(...); }

Key changes

  • SubcompactionState::Cleanup: Now takes an overall_status parameter and calls ReleaseObsolete when either the subcompaction status or the overall compaction status is non-OK.
  • CompactionJob::CleanupCompaction: Passes compact_->status (the overall status) to each subcompaction's Cleanup.
  • Sync point: Added CompactionJob::Run():AfterVerifyOutputFiles for error injection in tests.
  • Unit test: DBCompactionTest.LeakedTableCacheEntryOnCompactionFailure uses FaultInjectionTestFS to reproduce the crash test scenario — injects error after VerifyOutputFiles and deactivates the filesystem so GetChildren fails in FindObsoleteFiles, preventing the backstop from evicting the leaked cache entry.

Test Plan

  • DBCompactionTest.LeakedTableCacheEntryOnCompactionFailure:
    • Without fix (ASAN): assertion fires during Close()File 12 is not live nor quarantined
    • With fix (ASAN): passes — ReleaseObsolete properly cleans up the cache entry
    • With fix (non-ASAN): passes
$ COMPILE_WITH_ASAN=1 make -j db_compaction_test
$ ./db_compaction_test --gtest_filter="DBCompactionTest.LeakedTableCacheEntryOnCompactionFailure"
[  PASSED  ] 1 test.

@joshkang97 joshkang97 requested a review from pdillinger March 18, 2026 21:49
@meta-cla meta-cla bot added the CLA Signed label Mar 18, 2026
@joshkang97 joshkang97 requested a review from xingbowang March 18, 2026 21:49
@joshkang97 joshkang97 marked this pull request as ready for review March 18, 2026 21:49
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 18, 2026

@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D97190944.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 18, 2026

✅ clang-tidy: No findings on changed lines

Completed in 287.5s.

Copy link
Copy Markdown
Contributor

@hx235 hx235 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a unit test that failed deterministically before your fix and pass after if such unit test does not exist yet? Crash test is not failing often enough to serve that purpose.

@joshkang97 joshkang97 requested a review from jaykorean March 20, 2026 22:58
@joshkang97 joshkang97 requested a review from hx235 March 20, 2026 23:00
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 20, 2026

@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D97190944.

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 21, 2026

@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D97190944.

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 22, 2026

@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D97190944.

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 23, 2026

@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D97190944.

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 23, 2026

@joshkang97 merged this pull request in 3d99378.

pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Apr 1, 2026
Summary:
When CompactionJob::Run() succeeds but Install() fails (e.g., LogAndApply
MANIFEST I/O error), compact_->status was never updated with the install
failure. CleanupCompaction() passed the stale OK status to
SubcompactionState::Cleanup(), which skipped ReleaseObsolete -- leaking
table cache entries for output files that were cached by VerifyOutputFiles
but never installed into any Version.

This is the same class of bug fixed in facebook#14469 (where Run() failed after
VerifyOutputFiles), but in the Install() failure path. The
FindObsoleteFiles full-scan backstop would normally catch this, but fails
under crash test metadata read fault injection
(--open_metadata_read_fault_one_in), causing the
TEST_VerifyNoObsoleteFilesCached assertion to fire during Close().

Fix: propagate Install()'s local status back to compact_->status before
CleanupCompaction(), so Cleanup() sees the failure and calls
ReleaseObsolete on the output files.

Test Plan:
New unit test DBCompactionTest.LeakedTableCacheEntryOnInstallFailure:
- Without fix (ASAN): assertion fires -- "File 12 is not live nor quarantined"
- With fix (ASAN): passes -- ReleaseObsolete properly cleans up the entry
```
COMPILE_WITH_ASAN=1 make -j db_compaction_test
./db_compaction_test --gtest_filter="DBCompactionTest.LeakedTableCacheEntry*"
[  PASSED  ] 2 tests.
```

Tasks: T218515781
meta-codesync bot pushed a commit that referenced this pull request Apr 2, 2026
Summary:
When CompactionJob::Run() succeeds but Install() fails (e.g., LogAndApply MANIFEST I/O error), compact_->status was never updated with the install failure. CleanupCompaction() passed the stale OK status to SubcompactionState::Cleanup(), which skipped ReleaseObsolete -- leaking table cache entries for output files that were cached by VerifyOutputFiles but never installed into any Version.

This is the same class of bug fixed in #14469 (where Run() failed after VerifyOutputFiles), but in the Install() failure path. The FindObsoleteFiles full-scan backstop would normally catch this, but fails under crash test metadata read fault injection
(--open_metadata_read_fault_one_in), causing the
TEST_VerifyNoObsoleteFilesCached assertion to fire during Close().

Fix: propagate Install()'s local status back to compact_->status before CleanupCompaction(), so Cleanup() sees the failure and calls ReleaseObsolete on the output files.

Pull Request resolved: #14549

Test Plan:
New unit test DBCompactionTest.LeakedTableCacheEntryOnInstallFailure:
- Without fix (ASAN): assertion fires -- "File 12 is not live nor quarantined"
- With fix (ASAN): passes -- ReleaseObsolete properly cleans up the entry
```
COMPILE_WITH_ASAN=1 make -j db_compaction_test
./db_compaction_test --gtest_filter="DBCompactionTest.LeakedTableCacheEntry*"
[  PASSED  ] 2 tests.
```

Reviewed By: joshkang97

Differential Revision: D99155908

Pulled By: pdillinger

fbshipit-source-id: ed5374a38d7903866a38a0fe0f5539e12321bc84
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants