Deflake unit test DBErrorHandlingFSTest.AtomicFlushNoSpaceError#13234
Closed
cbi42 wants to merge 1 commit intofacebook:mainfrom
Closed
Deflake unit test DBErrorHandlingFSTest.AtomicFlushNoSpaceError#13234cbi42 wants to merge 1 commit intofacebook:mainfrom
DBErrorHandlingFSTest.AtomicFlushNoSpaceError#13234cbi42 wants to merge 1 commit intofacebook:mainfrom
Conversation
Contributor
|
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
jowlyzhang
approved these changes
Dec 20, 2024
Contributor
jowlyzhang
left a comment
There was a problem hiding this comment.
Thanks for the fix. Nice debugging!
Contributor
ybtsdst
pushed a commit
to ybtsdst/rocksdb
that referenced
this pull request
Apr 27, 2025
…cebook#13234) Summary: `DBErrorHandlingFSTest.AtomicFlushNoSpaceError` is flaky due to seg fault during error recovery: ``` ... frame facebook#5: 0x00007f0b3ea0a9d6 librocksdb.so.9.10`rocksdb::VersionSet::GetObsoleteFiles(std::vector<rocksdb::ObsoleteFileInfo, std::allocator<rocksdb::ObsoleteFileInfo>>*, std::vector<rocksdb::ObsoleteBlobFileInfo, std::allocator<rocksdb::ObsoleteBlobFileInfo>>*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>*, unsigned long) [inlined] std::vector<rocksdb::ObsoleteFileInfo, std::allocator<rocksdb::ObsoleteFileInfo>>::begin(this=<unavailable>) at stl_vector.h:812:16 frame facebook#6: 0x00007f0b3ea0a9d6 librocksdb.so.9.10`rocksdb::VersionSet::GetObsoleteFiles(this=0x0000000000000000, files=size=0, blob_files=size=0, manifest_filenames=size=0, min_pending_output=18446744073709551615) at version_set.cc:7258:18 frame facebook#7: 0x00007f0b3e8ccbc0 librocksdb.so.9.10`rocksdb::DBImpl::FindObsoleteFiles(this=<unavailable>, job_context=<unavailable>, force=<unavailable>, no_full_scan=<unavailable>) at db_impl_files.cc:162:30 frame facebook#8: 0x00007f0b3e85e698 librocksdb.so.9.10`rocksdb::DBImpl::ResumeImpl(this=<unavailable>, context=<unavailable>) at db_impl.cc:434:20 frame facebook#9: 0x00007f0b3e921516 librocksdb.so.9.10`rocksdb::ErrorHandler::RecoverFromBGError(this=<unavailable>, is_manual=<unavailable>) at error_handler.cc:632:46 ``` I suspect this is due to DB being destructed and reopened during recovery. Specifically, the [ClearBGError() call](https://github.com/facebook/rocksdb/blob/c72e79a262bf696faf5f8becabf92374fc14b464/db/db_impl/db_impl.cc#L425) can release and reacquire mutex, and DB can be closed during this time. So it's not safe to access DB state after ClearBGError(). There was a similar story in facebook#9496. [Moving the obsolete files logic after ClearBGError()](facebook#11955) probably makes the seg fault more easily triggered. This PR updates `ClearBGError()` to guarantee that db close cannot finish until the method is returned and the mutex is released. So that we can safely access DB state after calling it. Pull Request resolved: facebook#13234 Test Plan: I could not trigger the seg fault locally, will just monitor future test failures. Reviewed By: jowlyzhang Differential Revision: D67476836 Pulled By: cbi42 fbshipit-source-id: dfb3e9ccd4eb3d43fc596ec10e4052861eeec002
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
DBErrorHandlingFSTest.AtomicFlushNoSpaceErroris flaky due to seg fault during error recovery:I suspect that this is due to DB being destructed and reopened during recovery. Specifically, the ClearBGError() call can release and re-acquire mutex, and DB can be closed during this time. So it's not safe to access DB states after ClearBGError(). There was a similar story in #9496. Moving the obsolete files logic after ClearBGError() probably makes the seg fault more easily triggered.
This PR updates
ClearBGError()to guarantee that db close cannot finish until the method returns and the mutex is released. So that we can safely access DB states after calling it.Test plan: I could not trigger the seg fault locally, will just monitor future test failures.