mds: emit warning with estimated replay time#52527
mds: emit warning with estimated replay time#52527
Conversation
There was a problem hiding this comment.
Don't forget a PendingReleaseNote. Add the new beacon health code to doc/cephfs/health-messages.rst.
Add a test case which checks for this health warning. I think qa/tasks/cephfs/test_failover.py is a suitable place for a test. Grep for self.wait_for_health to see how to check for the warning. You will want to add a dev config to artificially slow down replay (adding a sleep() to MDLog::_replay_thread so you can see the health warning.
Edit: sync with @rishabh-d-dave on how to get vstart_runner.py working for writing your test case.
|
Also, please don't forget to check the boxes in the PR description for your changes. Finally, mark the ticket as "Fix under review" and put the PR number in "Pull Request ID" field. |
|
Log: |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
@manishym On this today |
vshankar
left a comment
There was a problem hiding this comment.
@manishym Regarding the test failure - its possible that the health warning showed up but wait_for_health missed it since it uses wait_until_true which has a (default) sleep period of 5s. You can do a couple of things here:
- dump the health metric in wait_for_health every time to verify that the warning does show up
- (preferably) increase the data set size to increase the time spend by the MDS in up:replay and thereby the health warning will stick around for longer and not to be missed by the health check helper.
Also, I suggest that include an additional check in the test to verify if the health warning gets cleared when the MDS transitions to up:active.
|
@manishym ping? |
c2b76e2 to
6e197ae
Compare
|
@manishym I guess this is waiting for an update to be pushed since many comments are still pending to be resolved, yes? |
|
@vshankar I have addressed all the comments. |
Will have a look tomorrow. |
f663b9a to
82af065
Compare
82af065 to
51839cf
Compare
51839cf to
11d8455
Compare
a1da91d to
4f7232b
Compare
4f7232b to
7a0595a
Compare
* When MDS might take time replaying the journal. It is helpful to get an estimate of how much time it might take to finish replaying the journal. * Fixes: https://tracker.ceph.com/issues/61863 Signed-off-by: Manish M Yathnalli <myathnal@redhat.com>
7a0595a to
2971358
Compare
|
jenkins test make check |
|
Test runs in ~2h - wip-vshankar-testing-20231127.102654 |
|
jenkins test make check |
|
https://pulpito.ceph.com/?branch=wip-vshankar-testing-20231127.102654 (rhel pkg install failures are a bunch, so, this would need a revalidate) |
|
jenkins test make check |
| self.fs.fail() | ||
| self.fs.set_joinable() | ||
|
|
||
| def test_replay_beacon_estimated_time(self): |
There was a problem hiding this comment.
There is a (probably unrelated) test failure here: https://pulpito.ceph.com/vshankar-2024-01-10_15:00:23-fs-wip-vshankar-testing-20240103.072409-1-testing-default-smithi/7511461/
Which hints at a kernel umount hang. I've requested @lxbsz to have a look. However, the subsequent tests didn't run since the test case aborted on the failed test, so I'll rerun specific test once @lxbsz updates. Rest of the fs suite tests look good.
There was a problem hiding this comment.
Looks like the umount hang is due to the MDS still in replay since the test case modified a config to slow down replay. The config needs to be reset and the test should wait for the MDS to be back active before ending (client unmounting).
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
@manishym I'll fixup the failures are push a change. Since I cannot push an update to a PR owned by you, I'll push a new PR with this change and fixes on top. I'll preserve the contribution tags of course :) |
MDS might take time replaying the journal, it is helpful to get an estimate of how much time it might take to finish replaying the journal. *
Fixes: https://tracker.ceph.com/issues/61863
Signed-off-by: Manish M Yathnalli myathnal@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows