Deflake unit test `DBErrorHandlingFSTest.AtomicFlushNoSpaceError` by cbi42 · Pull Request #13234 · facebook/rocksdb

cbi42 · 2024-12-19T20:52:17Z

Summary: DBErrorHandlingFSTest.AtomicFlushNoSpaceError is flaky due to seg fault during error recovery:

...
frame #5: 0x00007f0b3ea0a9d6 librocksdb.so.9.10`rocksdb::VersionSet::GetObsoleteFiles(std::vector<rocksdb::ObsoleteFileInfo, std::allocator<rocksdb::ObsoleteFileInfo>>*, std::vector<rocksdb::ObsoleteBlobFileInfo, std::allocator<rocksdb::ObsoleteBlobFileInfo>>*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>*, unsigned long) [inlined] std::vector<rocksdb::ObsoleteFileInfo, std::allocator<rocksdb::ObsoleteFileInfo>>::begin(this=<unavailable>) at stl_vector.h:812:16
frame #6: 0x00007f0b3ea0a9d6 librocksdb.so.9.10`rocksdb::VersionSet::GetObsoleteFiles(this=0x0000000000000000, files=size=0, blob_files=size=0, manifest_filenames=size=0, min_pending_output=18446744073709551615) at version_set.cc:7258:18
frame #7: 0x00007f0b3e8ccbc0 librocksdb.so.9.10`rocksdb::DBImpl::FindObsoleteFiles(this=<unavailable>, job_context=<unavailable>, force=<unavailable>, no_full_scan=<unavailable>) at db_impl_files.cc:162:30
frame #8: 0x00007f0b3e85e698 librocksdb.so.9.10`rocksdb::DBImpl::ResumeImpl(this=<unavailable>, context=<unavailable>) at db_impl.cc:434:20
frame #9: 0x00007f0b3e921516 librocksdb.so.9.10`rocksdb::ErrorHandler::RecoverFromBGError(this=<unavailable>, is_manual=<unavailable>) at error_handler.cc:632:46

I suspect that this is due to DB being destructed and reopened during recovery. Specifically, the ClearBGError() call can release and re-acquire mutex, and DB can be closed during this time. So it's not safe to access DB states after ClearBGError(). There was a similar story in #9496. Moving the obsolete files logic after ClearBGError() probably makes the seg fault more easily triggered.

This PR updates ClearBGError() to guarantee that db close cannot finish until the method returns and the mutex is released. So that we can safely access DB states after calling it.

Test plan: I could not trigger the seg fault locally, will just monitor future test failures.

facebook-github-bot · 2024-12-19T21:33:12Z

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

jowlyzhang

Thanks for the fix. Nice debugging!

facebook-github-bot · 2024-12-20T01:01:08Z

@cbi42 merged this pull request in cc30226.

…cebook#13234) Summary: `DBErrorHandlingFSTest.AtomicFlushNoSpaceError` is flaky due to seg fault during error recovery: ``` ... frame facebook#5: 0x00007f0b3ea0a9d6 librocksdb.so.9.10`rocksdb::VersionSet::GetObsoleteFiles(std::vector<rocksdb::ObsoleteFileInfo, std::allocator<rocksdb::ObsoleteFileInfo>>*, std::vector<rocksdb::ObsoleteBlobFileInfo, std::allocator<rocksdb::ObsoleteBlobFileInfo>>*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>*, unsigned long) [inlined] std::vector<rocksdb::ObsoleteFileInfo, std::allocator<rocksdb::ObsoleteFileInfo>>::begin(this=<unavailable>) at stl_vector.h:812:16 frame facebook#6: 0x00007f0b3ea0a9d6 librocksdb.so.9.10`rocksdb::VersionSet::GetObsoleteFiles(this=0x0000000000000000, files=size=0, blob_files=size=0, manifest_filenames=size=0, min_pending_output=18446744073709551615) at version_set.cc:7258:18 frame facebook#7: 0x00007f0b3e8ccbc0 librocksdb.so.9.10`rocksdb::DBImpl::FindObsoleteFiles(this=<unavailable>, job_context=<unavailable>, force=<unavailable>, no_full_scan=<unavailable>) at db_impl_files.cc:162:30 frame facebook#8: 0x00007f0b3e85e698 librocksdb.so.9.10`rocksdb::DBImpl::ResumeImpl(this=<unavailable>, context=<unavailable>) at db_impl.cc:434:20 frame facebook#9: 0x00007f0b3e921516 librocksdb.so.9.10`rocksdb::ErrorHandler::RecoverFromBGError(this=<unavailable>, is_manual=<unavailable>) at error_handler.cc:632:46 ``` I suspect this is due to DB being destructed and reopened during recovery. Specifically, the [ClearBGError() call](https://github.com/facebook/rocksdb/blob/c72e79a262bf696faf5f8becabf92374fc14b464/db/db_impl/db_impl.cc#L425) can release and reacquire mutex, and DB can be closed during this time. So it's not safe to access DB state after ClearBGError(). There was a similar story in facebook#9496. [Moving the obsolete files logic after ClearBGError()](facebook#11955) probably makes the seg fault more easily triggered. This PR updates `ClearBGError()` to guarantee that db close cannot finish until the method is returned and the mutex is released. So that we can safely access DB state after calling it. Pull Request resolved: facebook#13234 Test Plan: I could not trigger the seg fault locally, will just monitor future test failures. Reviewed By: jowlyzhang Differential Revision: D67476836 Pulled By: cbi42 fbshipit-source-id: dfb3e9ccd4eb3d43fc596ec10e4052861eeec002

not allow db close during ClearError()

1f63387

facebook-github-bot added the CLA Signed label Dec 19, 2024

cbi42 requested review from hx235 and jowlyzhang December 19, 2024 21:52

jowlyzhang approved these changes Dec 20, 2024

View reviewed changes

facebook-github-bot closed this in cc30226 Dec 20, 2024

facebook-github-bot added the Merged label Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deflake unit test `DBErrorHandlingFSTest.AtomicFlushNoSpaceError`#13234

Deflake unit test `DBErrorHandlingFSTest.AtomicFlushNoSpaceError`#13234
cbi42 wants to merge 1 commit intofacebook:mainfrom
cbi42:fix-resume-db-access

cbi42 commented Dec 19, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Dec 19, 2024

Uh oh!

jowlyzhang left a comment

Uh oh!

facebook-github-bot commented Dec 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cbi42 commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Dec 19, 2024

Uh oh!

jowlyzhang left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Dec 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cbi42 commented Dec 19, 2024 •

edited

Loading