Skip to content

Fix table cache leak when InstallCompactionResults fails#14549

Closed
pdillinger wants to merge 2 commits intofacebook:mainfrom
pdillinger:cached_file_is_live_or_quar_fix
Closed

Fix table cache leak when InstallCompactionResults fails#14549
pdillinger wants to merge 2 commits intofacebook:mainfrom
pdillinger:cached_file_is_live_or_quar_fix

Conversation

@pdillinger
Copy link
Copy Markdown
Contributor

Summary:
When CompactionJob::Run() succeeds but Install() fails (e.g., LogAndApply MANIFEST I/O error), compact_->status was never updated with the install failure. CleanupCompaction() passed the stale OK status to SubcompactionState::Cleanup(), which skipped ReleaseObsolete -- leaking table cache entries for output files that were cached by VerifyOutputFiles but never installed into any Version.

This is the same class of bug fixed in #14469 (where Run() failed after VerifyOutputFiles), but in the Install() failure path. The FindObsoleteFiles full-scan backstop would normally catch this, but fails under crash test metadata read fault injection
(--open_metadata_read_fault_one_in), causing the
TEST_VerifyNoObsoleteFilesCached assertion to fire during Close().

Fix: propagate Install()'s local status back to compact_->status before CleanupCompaction(), so Cleanup() sees the failure and calls ReleaseObsolete on the output files.

Test Plan:
New unit test DBCompactionTest.LeakedTableCacheEntryOnInstallFailure:

  • Without fix (ASAN): assertion fires -- "File 12 is not live nor quarantined"
  • With fix (ASAN): passes -- ReleaseObsolete properly cleans up the entry
COMPILE_WITH_ASAN=1 make -j db_compaction_test
./db_compaction_test --gtest_filter="DBCompactionTest.LeakedTableCacheEntry*"
[  PASSED  ] 2 tests.

Tasks: T218515781

Summary:
When CompactionJob::Run() succeeds but Install() fails (e.g., LogAndApply
MANIFEST I/O error), compact_->status was never updated with the install
failure. CleanupCompaction() passed the stale OK status to
SubcompactionState::Cleanup(), which skipped ReleaseObsolete -- leaking
table cache entries for output files that were cached by VerifyOutputFiles
but never installed into any Version.

This is the same class of bug fixed in facebook#14469 (where Run() failed after
VerifyOutputFiles), but in the Install() failure path. The
FindObsoleteFiles full-scan backstop would normally catch this, but fails
under crash test metadata read fault injection
(--open_metadata_read_fault_one_in), causing the
TEST_VerifyNoObsoleteFilesCached assertion to fire during Close().

Fix: propagate Install()'s local status back to compact_->status before
CleanupCompaction(), so Cleanup() sees the failure and calls
ReleaseObsolete on the output files.

Test Plan:
New unit test DBCompactionTest.LeakedTableCacheEntryOnInstallFailure:
- Without fix (ASAN): assertion fires -- "File 12 is not live nor quarantined"
- With fix (ASAN): passes -- ReleaseObsolete properly cleans up the entry
```
COMPILE_WITH_ASAN=1 make -j db_compaction_test
./db_compaction_test --gtest_filter="DBCompactionTest.LeakedTableCacheEntry*"
[  PASSED  ] 2 tests.
```

Tasks: T218515781
@pdillinger pdillinger requested a review from joshkang97 April 1, 2026 20:49
@meta-cla meta-cla bot added the CLA Signed label Apr 1, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Apr 1, 2026

@pdillinger has imported this pull request. If you are a Meta employee, you can view this in D99155908.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

✅ clang-tidy: No findings on changed lines

Completed in 369.6s.

// (local status = error), Cleanup would see overall_status = OK and skip
// ReleaseObsolete, leaking table cache entries for output files that were
// never installed into any Version.
compact_->status = status;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2026-04-01T21:04:35.3201315Z 0x00007f77c200742f in __GI___wait4 (pid=3939747, stat_loc=0x7f77c18f01ec, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
2026-04-01T21:04:35.3202548Z 30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
2026-04-01T21:04:35.3203939Z #4  0x00005595e5c1dec8 in alternative_rocksdb_ns::CompactionState::~CompactionState (this=0x7f77bc008060, __in_chrg=<optimized out>) at ./db/compaction/compaction_state.h:23
2026-04-01T21:04:35.3205165Z 23	class CompactionState {
2026-04-01T21:04:35.3206035Z #5  alternative_rocksdb_ns::CompactionJob::CleanupCompaction (this=this@entry=0x7f77c18f13f0) at db/compaction/compaction_job.cc:2567

Seems like status check is missing or needs to be ignored

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Apr 2, 2026

@pdillinger has imported this pull request. If you are a Meta employee, you can view this in D99155908.

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Apr 2, 2026

@pdillinger merged this pull request in abc1f7c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants