mds: emit warning with estimated replay time by manishym · Pull Request #52527 · ceph/ceph

manishym · 2023-07-19T05:26:00Z

MDS might take time replaying the journal, it is helpful to get an estimate of how much time it might take to finish replaying the journal. *
Fixes: https://tracker.ceph.com/issues/61863
Signed-off-by: Manish M Yathnalli myathnal@redhat.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

batrick

Don't forget a PendingReleaseNote. Add the new beacon health code to doc/cephfs/health-messages.rst.

Add a test case which checks for this health warning. I think qa/tasks/cephfs/test_failover.py is a suitable place for a test. Grep for self.wait_for_health to see how to check for the warning. You will want to add a dev config to artificially slow down replay (adding a sleep() to MDLog::_replay_thread so you can see the health warning.

Edit: sync with @rishabh-d-dave on how to get vstart_runner.py working for writing your test case.

src/mds/MDLog.cc

batrick · 2023-07-19T13:09:20Z

Also, please don't forget to check the boxes in the PR description for your changes. Finally, mark the ticket as "Fix under review" and put the PR number in "Pull Request ID" field.

vshankar · 2023-07-19T13:12:19Z

@batrick https://tracker.ceph.com/issues/61863#note-3

manishym · 2023-08-01T00:56:04Z

Log:
→ until grep -ir "MDS_ESTIMATED_REPLAY_TIME" ./out/; do echo "Retrying" ; done ./out/mon.b.log:2023-08-01T06:19:08.492+0530 7fdc57800700 7 mon.b@1(peon).log v78 update_from_paxos applying incremental log 78 2023-08-01T06:19:07.672661+0530 mon.a (mon.0) 257 : cluster [WRN] Health check failed: 1 estimated journal reply time (MDS_ESTIMATED_REPLAY_TIME) ./out/mon.b.log:2023-08-01T06:19:10.564+0530 7fdc57800700 7 mon.b@1(peon).log v80 update_from_paxos applying incremental log 80 2023-08-01T06:19:09.696497+0530 mon.a (mon.0) 268 : cluster [INF] Health check cleared: MDS_ESTIMATED_REPLAY_TIME (was: 1 estimated journal reply time) ./out/cluster.mon.c.log:2023-08-01T06:19:07.672661+0530 mon.a (mon.0) 257 : cluster 3 Health check failed: 1 estimated journal reply time (MDS_ESTIMATED_REPLAY_TIME) ./out/cluster.mon.c.log:2023-08-01T06:19:09.696497+0530 mon.a (mon.0) 268 : cluster 1 Health check cleared: MDS_ESTIMATED_REPLAY_TIME (was: 1 estimated journal reply time) ./out/mon.c.log:2023-08-01T06:19:08.496+0530 7f4bb98d6700 7 mon.c@2(peon).log v78 update_from_paxos applying incremental log 78 2023-08-01T06:19:07.672661+0530 mon.a (mon.0) 257 : cluster [WRN] Health check failed: 1 estimated journal reply time (MDS_ESTIMATED_REPLAY_TIME) ./out/mon.c.log:2023-08-01T06:19:10.568+0530 7f4bb98d6700 7 mon.c@2(peon).log v80 update_from_paxos applying incremental log 80 2023-08-01T06:19:09.696497+0530 mon.a (mon.0) 268 : cluster [INF] Health check cleared: MDS_ESTIMATED_REPLAY_TIME (was: 1 estimated journal reply time) ./out/cluster.mon.b.log:2023-08-01T06:19:07.672661+0530 mon.a (mon.0) 257 : cluster 3 Health check failed: 1 estimated journal reply time (MDS_ESTIMATED_REPLAY_TIME) ./out/cluster.mon.b.log:2023-08-01T06:19:09.696497+0530 mon.a (mon.0) 268 : cluster 1 Health check cleared: MDS_ESTIMATED_REPLAY_TIME (was: 1 estimated journal reply time) ./out/cluster.mon.a.log:2023-08-01T06:19:07.672661+0530 mon.a (mon.0) 257 : cluster 3 Health check failed: 1 estimated journal reply time (MDS_ESTIMATED_REPLAY_TIME) ./out/cluster.mon.a.log:2023-08-01T06:19:09.696497+0530 mon.a (mon.0) 268 : cluster 1 Health check cleared: MDS_ESTIMATED_REPLAY_TIME (was: 1 estimated journal reply time) ./out/mon.a.log:2023-08-01T06:19:07.668+0530 7fee5bb64700 0 log_channel(cluster) log [WRN] : Health check failed: 1 estimated journal reply time (MDS_ESTIMATED_REPLAY_TIME) ./out/mon.a.log:2023-08-01T06:19:07.672+0530 7fee5935f700 10 mon.a@0(leader).log v77 logging 2023-08-01T06:19:07.672661+0530 mon.a (mon.0) 257 : cluster [WRN] Health check failed: 1 estimated journal reply time (MDS_ESTIMATED_REPLAY_TIME) ./out/mon.a.log:2023-08-01T06:19:08.484+0530 7fee56b5a700 7 mon.a@0(leader).log v78 update_from_paxos applying incremental log 78 2023-08-01T06:19:07.672661+0530 mon.a (mon.0) 257 : cluster [WRN] Health check failed: 1 estimated journal reply time (MDS_ESTIMATED_REPLAY_TIME) ./out/mon.a.log:2023-08-01T06:19:09.692+0530 7fee5bb64700 0 log_channel(cluster) log [INF] : Health check cleared: MDS_ESTIMATED_REPLAY_TIME (was: 1 estimated journal reply time) ./out/mon.a.log:2023-08-01T06:19:09.696+0530 7fee5935f700 10 mon.a@0(leader).log v79 logging 2023-08-01T06:19:09.696497+0530 mon.a (mon.0) 268 : cluster [INF] Health check cleared: MDS_ESTIMATED_REPLAY_TIME (was: 1 estimated journal reply time) ./out/mon.a.log:2023-08-01T06:19:10.560+0530 7fee56b5a700 7 mon.a@0(leader).log v80 update_from_paxos applying incremental log 80 2023-08-01T06:19:09.696497+0530 mon.a (mon.0) 268 : cluster [INF] Health check cleared: MDS_ESTIMATED_REPLAY_TIME (was: 1 estimated journal reply time)
Error in unit test
2023-08-01 06:19:43,434.434 INFO:__main__: 2023-08-01 06:19:43,434.434 INFO:__main__:---------------------------------------------------------------------- 2023-08-01 06:19:43,435.435 INFO:__main__:Ran 1 test in 72.599s 2023-08-01 06:19:43,435.435 INFO:__main__: 2023-08-01 06:19:43,435.435 INFO:__main__:FAILED (errors=1) 2023-08-01 06:19:43,435.435 INFO:__main__: 2023-08-01 06:19:43,435.435 INFO:__main__: 2023-08-01 06:19:43,435.435 INFO:__main__:====================================================================== 2023-08-01 06:19:43,436.436 INFO:__main__:ERROR: test_replay_beacon_estimated_time (tasks.cephfs.test_failover.TestFailoverBeaconHealth) 2023-08-01 06:19:43,436.436 INFO:__main__:That beacon emits warning message with estimated time to complete replay 2023-08-01 06:19:43,436.436 INFO:__main__:---------------------------------------------------------------------- 2023-08-01 06:19:43,436.436 INFO:__main__:Traceback (most recent call last): 2023-08-01 06:19:43,436.436 INFO:__main__: File "/home/manish/work/ibm/ceph/ceph/qa/tasks/cephfs/test_failover.py", line 328, in test_replay_beacon_estimated_time 2023-08-01 06:19:43,436.436 INFO:__main__: self.wait_for_health("MDS_ESTIMATED_REPLAY_TIME", 20) 2023-08-01 06:19:43,436.436 INFO:__main__: File "/home/manish/work/ibm/ceph/ceph/qa/tasks/ceph_test_case.py", line 184, in wait_for_health 2023-08-01 06:19:43,437.437 INFO:__main__: self.wait_until_true(seen_health_warning, timeout) 2023-08-01 06:19:43,437.437 INFO:__main__: File "/home/manish/work/ibm/ceph/ceph/qa/tasks/ceph_test_case.py", line 231, in wait_until_true 2023-08-01 06:19:43,437.437 INFO:__main__: raise TestTimeoutError("Timed out after {0}s and {1} retries".format(elapsed, retry_count)) 2023-08-01 06:19:43,437.437 INFO:__main__:tasks.ceph_test_case.TestTimeoutError: Timed out after 20s and 0 retries 2023-08-01 06:19:43,437.437 INFO:__main__: Using guessed paths /home/manish/work/ibm/ceph/ceph/main_build/lib/ ['/home/manish/work/ibm/ceph/ceph/qa', '/home/manish/work/ibm/ceph/ceph/main_build/lib/cython_modules/lib.3', '/home/manish/work/ibm/ceph/ceph/src/pybind'] Using guessed paths /home/manish/work/ibm/ceph/ceph/main_build/lib/ ['/home/manish/work/ibm/ceph/ceph/qa', '/home/manish/work/ibm/ceph/ceph/main_build/lib/cython_modules/lib.3', '/home/manish/work/ibm/ceph/ceph/src/pybind']

github-actions · 2023-08-02T17:13:29Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

vshankar · 2023-08-03T02:10:52Z

@manishym On this today

vshankar

@manishym Regarding the test failure - its possible that the health warning showed up but wait_for_health missed it since it uses wait_until_true which has a (default) sleep period of 5s. You can do a couple of things here:

dump the health metric in wait_for_health every time to verify that the warning does show up
(preferably) increase the data set size to increase the time spend by the MDS in up:replay and thereby the health warning will stick around for longer and not to be missed by the health check helper.

Also, I suggest that include an additional check in the test to verify if the health warning gets cleared when the MDS transitions to up:active.

src/mds/MDLog.cc

src/mds/MDLog.h

src/mds/MDLog.cc

qa/tasks/cephfs/test_failover.py

vshankar · 2023-08-10T07:06:20Z

@manishym ping?

vshankar · 2023-08-23T05:29:32Z

@manishym I guess this is waiting for an update to be pushed since many comments are still pending to be resolved, yes?

manishym · 2023-08-28T11:39:53Z

@vshankar I have addressed all the comments.

vshankar · 2023-08-28T16:07:40Z

@vshankar I have addressed all the comments.

Will have a look tomorrow.

src/mds/MDLog.h

vshankar

Otherwise LGTM.

src/mds/Beacon.cc

batrick

Almost there!

src/mds/MDLog.h

src/mds/MDLog.cc

src/common/options/mds.yaml.in

src/mds/MDLog.cc

qa/tasks/cephfs/test_failover.py

src/common/options/mds.yaml.in

batrick

small simplification:

src/mds/MDLog.cc

* When MDS might take time replaying the journal. It is helpful to get an estimate of how much time it might take to finish replaying the journal. * Fixes: https://tracker.ceph.com/issues/61863 Signed-off-by: Manish M Yathnalli <myathnal@redhat.com>

Fixed

vshankar · 2023-11-22T04:11:04Z

jenkins test make check

vshankar · 2023-11-27T10:28:38Z

Test runs in ~2h - wip-vshankar-testing-20231127.102654

vshankar · 2023-11-27T10:28:49Z

jenkins test make check

vshankar · 2023-12-04T09:28:59Z

https://pulpito.ceph.com/?branch=wip-vshankar-testing-20231127.102654

(rhel pkg install failures are a bunch, so, this would need a revalidate)

vshankar · 2024-01-12T05:45:05Z

https://pulpito.ceph.com/?branch=wip-vshankar-testing-20240103.072409-1

vshankar · 2024-01-12T05:45:33Z

jenkins test make check

vshankar · 2024-01-16T07:38:47Z

qa/tasks/cephfs/test_failover.py

+        self.fs.fail()
+        self.fs.set_joinable()
+
+    def test_replay_beacon_estimated_time(self):


There is a (probably unrelated) test failure here: https://pulpito.ceph.com/vshankar-2024-01-10_15:00:23-fs-wip-vshankar-testing-20240103.072409-1-testing-default-smithi/7511461/

Which hints at a kernel umount hang. I've requested @lxbsz to have a look. However, the subsequent tests didn't run since the test case aborted on the failed test, so I'll rerun specific test once @lxbsz updates. Rest of the fs suite tests look good.

Looks like the umount hang is due to the MDS still in replay since the test case modified a config to slow down replay. The config needs to be reset and the test should wait for the MDS to be back active before ending (client unmounting).

github-actions · 2024-01-29T13:19:13Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

vshankar · 2024-02-08T04:19:36Z

@manishym I'll fixup the failures are push a change. Since I cannot push an update to a PR owned by you, I'll push a new PR with this change and fixes on top. I'll preserve the contribution tags of course :)

vshankar · 2024-02-16T12:06:14Z

Superseeded by #55616 #55616

manishym requested a review from vshankar July 19, 2023 05:26

github-actions bot added the cephfs Ceph File System label Jul 19, 2023

manishym requested a review from batrick July 19, 2023 05:27

vshankar requested a review from a team July 19, 2023 05:30

batrick requested changes Jul 19, 2023

View reviewed changes

src/mds/MDLog.cc Outdated Show resolved Hide resolved

src/mds/MDLog.cc Outdated Show resolved Hide resolved

src/mds/MDLog.cc Outdated Show resolved Hide resolved

src/mds/MDLog.cc Outdated Show resolved Hide resolved

src/mds/MDLog.cc Outdated Show resolved Hide resolved

github-actions bot added the tests label Aug 1, 2023

github-actions bot added the needs-rebase label Aug 2, 2023

vshankar previously requested changes Aug 3, 2023

View reviewed changes

manishym force-pushed the log_estimated_journal_replay_time branch from c2b76e2 to 6e197ae Compare August 14, 2023 12:02

github-actions bot removed the needs-rebase label Aug 14, 2023

manishym requested review from batrick and vshankar August 14, 2023 12:21

manishym requested a review from a team as a code owner August 28, 2023 10:53

manishym force-pushed the log_estimated_journal_replay_time branch 2 times, most recently from f663b9a to 82af065 Compare August 29, 2023 09:49

vshankar reviewed Aug 29, 2023

View reviewed changes

src/mds/MDLog.h Outdated Show resolved Hide resolved

manishym force-pushed the log_estimated_journal_replay_time branch from 82af065 to 51839cf Compare August 29, 2023 12:25

vshankar reviewed Sep 1, 2023

View reviewed changes

src/mds/Beacon.cc Outdated Show resolved Hide resolved

manishym force-pushed the log_estimated_journal_replay_time branch from 51839cf to 11d8455 Compare September 5, 2023 09:58

vshankar changed the title ~~MDLog: Estimated time to complete replay~~ mds: emit warning with estimated replay time Sep 6, 2023

vshankar added the wip-vshankar-testing1 label Sep 11, 2023

manishym force-pushed the log_estimated_journal_replay_time branch from a1da91d to 4f7232b Compare November 6, 2023 11:43

manishym requested a review from anthonyeleven November 6, 2023 13:23

batrick requested changes Nov 6, 2023

View reviewed changes

manishym force-pushed the log_estimated_journal_replay_time branch from 4f7232b to 7a0595a Compare November 7, 2023 10:17

manishym requested a review from batrick November 7, 2023 10:17

kotreshhr reviewed Nov 7, 2023

View reviewed changes

src/common/options/mds.yaml.in Outdated Show resolved Hide resolved

batrick requested changes Nov 7, 2023

View reviewed changes

src/mds/MDLog.cc Outdated Show resolved Hide resolved

src/mds/MDLog.cc Outdated Show resolved Hide resolved

manishym force-pushed the log_estimated_journal_replay_time branch from 7a0595a to 2971358 Compare November 7, 2023 17:18

manishym requested review from batrick and kotreshhr November 7, 2023 17:18

batrick approved these changes Nov 7, 2023

View reviewed changes

batrick added the needs-qa label Nov 7, 2023

vshankar added the wip-vshankar-testing label Nov 22, 2023

vshankar reviewed Jan 16, 2024

View reviewed changes

vshankar removed the wip-vshankar-testing label Jan 18, 2024

github-actions bot added the needs-rebase label Jan 29, 2024

rishabh-d-dave removed the needs-qa label Feb 2, 2024

vshankar mentioned this pull request Feb 16, 2024

mds: emit warning with estimated replay time #55616

Merged

14 tasks

vshankar closed this Feb 16, 2024

Conversation

manishym commented Jul 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

batrick left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

batrick commented Jul 19, 2023

Uh oh!

vshankar commented Jul 19, 2023

Uh oh!

manishym commented Aug 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 2, 2023

Uh oh!

vshankar commented Aug 3, 2023

Uh oh!

vshankar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vshankar commented Aug 10, 2023

Uh oh!

vshankar commented Aug 23, 2023

Uh oh!

manishym commented Aug 28, 2023

Uh oh!

vshankar commented Aug 28, 2023

Uh oh!

Uh oh!

vshankar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

batrick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

batrick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vshankar commented Nov 22, 2023

Uh oh!

vshankar commented Nov 27, 2023

Uh oh!

vshankar commented Nov 27, 2023

Uh oh!

vshankar commented Dec 4, 2023

Uh oh!

vshankar commented Jan 12, 2024

Uh oh!

vshankar commented Jan 12, 2024

Uh oh!

vshankar Jan 16, 2024

Choose a reason for hiding this comment

manishym commented Jul 19, 2023 •

edited

Loading

batrick left a comment •

edited

Loading

manishym commented Aug 1, 2023 •

edited

Loading