Deflake DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL#11016
Deflake DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL#11016hx235 wants to merge 2 commits intofacebook:mainfrom
Conversation
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
I think it is reproducible. Can you try this patch?
diff --git a/db/db_impl/db_impl_files.cc b/db/db_impl/db_impl_files.cc
index 9834ff4b2..a6d24dbdb 100644
--- a/db/db_impl/db_impl_files.cc
+++ b/db/db_impl/db_impl_files.cc
@@ -315,6 +315,8 @@ void DBImpl::FindObsoleteFiles(JobContext* job_context, bool force,
}
log_write_mutex_.Unlock();
mutex_.Unlock();
+ bg_cv_.SignalAll();
+ usleep(10000);
TEST_SYNC_POINT_CALLBACK("FindObsoleteFiles::PostMutexUnlock", nullptr);
log_write_mutex_.Lock();
while (!logs_.empty() && logs_.front().number < min_log_number) {
I see the following failure:
$ TEST_TMPDIR=/dev/shm ./db_wal_test --gtest_filter=DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL
Note: Google Test filter = DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBWALTest
[ RUN ] DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL
db/db_wal_test.cc:1662: Failure
Value of: sync_point_called
Actual: false
Expected: true
[ FAILED ] DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL (71 ms)
[----------] 1 test from DBWALTest (71 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (71 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL
1 FAILED TEST
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
pdillinger
left a comment
There was a problem hiding this comment.
Test still fails without original fix, so LGTM
(Not familiar enough with this area to know if the test can be simplified)
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
ajkr
left a comment
There was a problem hiding this comment.
LGTM too. This test could use higher level APIs (after all, it is a DB level test) to avoid so many sync points
Just to record the rough ideas. ASSERT_OK(static_cast_with_check<DBImpl>(db)->TEST_WaitForBackgroundWork()); after the Flush() (see #9617) may be able to replace both the dependencies. Small max_manifest_file_size and verifying manifest changed (via some API?) could replace the ProcessManifestWrites callback. wal_file_path also seems accessible via some get files API
Comparing |
@hx235 I hope this is tried or explicitly rejected, in which case I can try it |
|
@ajkr - Hi - sorry for some reason I forgot to follow up on this. Will try it this week. Any blocking issue that you'd like me to try it in a day or two (instead of by this week)? |
No, I was just reminded for unrelated reasons that our code is difficult to refactor, and I think excessive sync points is one reason (of many). |
Sounds good! Will give this a better try once I am back from my trip/PTO |
…singMissingWAL) (#11186) Summary: **Context/Summary**: Simplify `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` based on #11016 (review) and delete unused sync points. Pull Request resolved: #11186 Test Plan: - UT failed before fix in #10892 and passes after - Check UT not flaky when running with https://app.circleci.com/pipelines/github/facebook/rocksdb/21985/workflows/5f6cc355-78c1-46d8-89ee-0fd679725a8a/jobs/540878 Reviewed By: ajkr Differential Revision: D43034723 Pulled By: hx235 fbshipit-source-id: f503774987b8f3718505f99e95080a7fad28ac66
Context/Summary:
Credit to @ajkr's #11016 (review),
flaky test https://app.circleci.com/pipelines/github/facebook/rocksdb/21985/workflows/5f6cc355-78c1-46d8-89ee-0fd679725a8a/jobs/540878 is due to
Flush()called in the test returned earlier than obsoleted WAL being found in background flush and SyncWAL() was called (i.e, "sync_point_called" sets to true). Fix this by making checkingsync_point_called == trueafter obsoleted WAL is found andSyncWAL()is called. Also rename the "sync_point_called" to be something more specific.Also, fix a potential flakiness due to manually setting a log threshold to force new manifest creation. This is unreliable so I decided to use sync point to force new manifest creation.
Test plan:
make check