Skip to content

Deflake DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL#11016

Closed
hx235 wants to merge 2 commits intofacebook:mainfrom
hx235:deflake_test
Closed

Deflake DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL#11016
hx235 wants to merge 2 commits intofacebook:mainfrom
hx235:deflake_test

Conversation

@hx235
Copy link
Copy Markdown
Contributor

@hx235 hx235 commented Dec 4, 2022

Context/Summary:
Credit to @ajkr's #11016 (review),
flaky test https://app.circleci.com/pipelines/github/facebook/rocksdb/21985/workflows/5f6cc355-78c1-46d8-89ee-0fd679725a8a/jobs/540878 is due to Flush() called in the test returned earlier than obsoleted WAL being found in background flush and SyncWAL() was called (i.e, "sync_point_called" sets to true). Fix this by making checking sync_point_called == true after obsoleted WAL is found and SyncWAL() is called. Also rename the "sync_point_called" to be something more specific.

Also, fix a potential flakiness due to manually setting a log threshold to force new manifest creation. This is unreliable so I decided to use sync point to force new manifest creation.

Test plan:
make check

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 requested a review from ajkr December 4, 2022 06:43
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Copy Markdown
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is reproducible. Can you try this patch?

diff --git a/db/db_impl/db_impl_files.cc b/db/db_impl/db_impl_files.cc
index 9834ff4b2..a6d24dbdb 100644
--- a/db/db_impl/db_impl_files.cc
+++ b/db/db_impl/db_impl_files.cc
@@ -315,6 +315,8 @@ void DBImpl::FindObsoleteFiles(JobContext* job_context, bool force,
     }
     log_write_mutex_.Unlock();
     mutex_.Unlock();
+    bg_cv_.SignalAll();
+    usleep(10000);
     TEST_SYNC_POINT_CALLBACK("FindObsoleteFiles::PostMutexUnlock", nullptr);
     log_write_mutex_.Lock();
     while (!logs_.empty() && logs_.front().number < min_log_number) {

I see the following failure:

$ TEST_TMPDIR=/dev/shm ./db_wal_test --gtest_filter=DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL
Note: Google Test filter = DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBWALTest
[ RUN      ] DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL
db/db_wal_test.cc:1662: Failure
Value of: sync_point_called
  Actual: false
Expected: true
[  FAILED  ] DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL (71 ms)
[----------] 1 test from DBWALTest (71 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (71 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL

 1 FAILED TEST

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 changed the title Potentially deflake DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL Deflake DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL Dec 6, 2022
@hx235 hx235 requested review from ajkr and pdillinger December 6, 2022 20:51
Copy link
Copy Markdown
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test still fails without original fix, so LGTM

(Not familiar enough with this area to know if the test can be simplified)

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Copy Markdown
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too. This test could use higher level APIs (after all, it is a DB level test) to avoid so many sync points

Just to record the rough ideas. ASSERT_OK(static_cast_with_check<DBImpl>(db)->TEST_WaitForBackgroundWork()); after the Flush() (see #9617) may be able to replace both the dependencies. Small max_manifest_file_size and verifying manifest changed (via some API?) could replace the ProcessManifestWrites callback. wal_file_path also seems accessible via some get files API

@ajkr
Copy link
Copy Markdown
Contributor

ajkr commented Dec 6, 2022

verifying manifest changed (via some API?) could replace the ProcessManifestWrites callback

Comparing dbfull()->TEST_Current_Manifest_FileNo() before vs. after looks like a good API for it

@ajkr
Copy link
Copy Markdown
Contributor

ajkr commented Jan 22, 2023

This test could use higher level APIs (after all, it is a DB level test) to avoid so many sync points

@hx235 I hope this is tried or explicitly rejected, in which case I can try it

@hx235
Copy link
Copy Markdown
Contributor Author

hx235 commented Jan 23, 2023

@ajkr - Hi - sorry for some reason I forgot to follow up on this. Will try it this week. Any blocking issue that you'd like me to try it in a day or two (instead of by this week)?

@ajkr
Copy link
Copy Markdown
Contributor

ajkr commented Jan 23, 2023

Any blocking issue that you'd like me to try it in a day or two (instead of by this week)?

No, I was just reminded for unrelated reasons that our code is difficult to refactor, and I think excessive sync points is one reason (of many).

@hx235
Copy link
Copy Markdown
Contributor Author

hx235 commented Jan 30, 2023

Any blocking issue that you'd like me to try it in a day or two (instead of by this week)?

No, I was just reminded for unrelated reasons that our code is difficult to refactor, and I think excessive sync points is one reason (of many).

Sounds good! Will give this a better try once I am back from my trip/PTO

facebook-github-bot pushed a commit that referenced this pull request Feb 7, 2023
…singMissingWAL) (#11186)

Summary:
**Context/Summary**:
Simplify `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` based on #11016 (review) and delete unused sync points.

Pull Request resolved: #11186

Test Plan:
- UT failed before fix in #10892 and passes after
- Check UT not flaky when running with https://app.circleci.com/pipelines/github/facebook/rocksdb/21985/workflows/5f6cc355-78c1-46d8-89ee-0fd679725a8a/jobs/540878

Reviewed By: ajkr

Differential Revision: D43034723

Pulled By: hx235

fbshipit-source-id: f503774987b8f3718505f99e95080a7fad28ac66
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants