Skip to content

mds: emit warning with estimated replay time#52527

Closed
manishym wants to merge 1 commit intoceph:mainfrom
manishym:log_estimated_journal_replay_time
Closed

mds: emit warning with estimated replay time#52527
manishym wants to merge 1 commit intoceph:mainfrom
manishym:log_estimated_journal_replay_time

Conversation

@manishym
Copy link
Collaborator

@manishym manishym commented Jul 19, 2023

MDS might take time replaying the journal, it is helpful to get an estimate of how much time it might take to finish replaying the journal. *
Fixes: https://tracker.ceph.com/issues/61863
Signed-off-by: Manish M Yathnalli myathnal@redhat.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@manishym manishym requested a review from vshankar July 19, 2023 05:26
@github-actions github-actions bot added the cephfs Ceph File System label Jul 19, 2023
@manishym manishym requested a review from batrick July 19, 2023 05:27
@vshankar vshankar requested a review from a team July 19, 2023 05:30
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget a PendingReleaseNote. Add the new beacon health code to doc/cephfs/health-messages.rst.

Add a test case which checks for this health warning. I think qa/tasks/cephfs/test_failover.py is a suitable place for a test. Grep for self.wait_for_health to see how to check for the warning. You will want to add a dev config to artificially slow down replay (adding a sleep() to MDLog::_replay_thread so you can see the health warning.

Edit: sync with @rishabh-d-dave on how to get vstart_runner.py working for writing your test case.

@batrick
Copy link
Member

batrick commented Jul 19, 2023

Also, please don't forget to check the boxes in the PR description for your changes. Finally, mark the ticket as "Fix under review" and put the PR number in "Pull Request ID" field.

@vshankar
Copy link
Contributor

@github-actions github-actions bot added the tests label Aug 1, 2023
@manishym
Copy link
Collaborator Author

manishym commented Aug 1, 2023

Log:
→ until grep -ir "MDS_ESTIMATED_REPLAY_TIME" ./out/; do echo "Retrying" ; done ./out/mon.b.log:2023-08-01T06:19:08.492+0530 7fdc57800700 7 mon.b@1(peon).log v78 update_from_paxos applying incremental log 78 2023-08-01T06:19:07.672661+0530 mon.a (mon.0) 257 : cluster [WRN] Health check failed: 1 estimated journal reply time (MDS_ESTIMATED_REPLAY_TIME) ./out/mon.b.log:2023-08-01T06:19:10.564+0530 7fdc57800700 7 mon.b@1(peon).log v80 update_from_paxos applying incremental log 80 2023-08-01T06:19:09.696497+0530 mon.a (mon.0) 268 : cluster [INF] Health check cleared: MDS_ESTIMATED_REPLAY_TIME (was: 1 estimated journal reply time) ./out/cluster.mon.c.log:2023-08-01T06:19:07.672661+0530 mon.a (mon.0) 257 : cluster 3 Health check failed: 1 estimated journal reply time (MDS_ESTIMATED_REPLAY_TIME) ./out/cluster.mon.c.log:2023-08-01T06:19:09.696497+0530 mon.a (mon.0) 268 : cluster 1 Health check cleared: MDS_ESTIMATED_REPLAY_TIME (was: 1 estimated journal reply time) ./out/mon.c.log:2023-08-01T06:19:08.496+0530 7f4bb98d6700 7 mon.c@2(peon).log v78 update_from_paxos applying incremental log 78 2023-08-01T06:19:07.672661+0530 mon.a (mon.0) 257 : cluster [WRN] Health check failed: 1 estimated journal reply time (MDS_ESTIMATED_REPLAY_TIME) ./out/mon.c.log:2023-08-01T06:19:10.568+0530 7f4bb98d6700 7 mon.c@2(peon).log v80 update_from_paxos applying incremental log 80 2023-08-01T06:19:09.696497+0530 mon.a (mon.0) 268 : cluster [INF] Health check cleared: MDS_ESTIMATED_REPLAY_TIME (was: 1 estimated journal reply time) ./out/cluster.mon.b.log:2023-08-01T06:19:07.672661+0530 mon.a (mon.0) 257 : cluster 3 Health check failed: 1 estimated journal reply time (MDS_ESTIMATED_REPLAY_TIME) ./out/cluster.mon.b.log:2023-08-01T06:19:09.696497+0530 mon.a (mon.0) 268 : cluster 1 Health check cleared: MDS_ESTIMATED_REPLAY_TIME (was: 1 estimated journal reply time) ./out/cluster.mon.a.log:2023-08-01T06:19:07.672661+0530 mon.a (mon.0) 257 : cluster 3 Health check failed: 1 estimated journal reply time (MDS_ESTIMATED_REPLAY_TIME) ./out/cluster.mon.a.log:2023-08-01T06:19:09.696497+0530 mon.a (mon.0) 268 : cluster 1 Health check cleared: MDS_ESTIMATED_REPLAY_TIME (was: 1 estimated journal reply time) ./out/mon.a.log:2023-08-01T06:19:07.668+0530 7fee5bb64700 0 log_channel(cluster) log [WRN] : Health check failed: 1 estimated journal reply time (MDS_ESTIMATED_REPLAY_TIME) ./out/mon.a.log:2023-08-01T06:19:07.672+0530 7fee5935f700 10 mon.a@0(leader).log v77 logging 2023-08-01T06:19:07.672661+0530 mon.a (mon.0) 257 : cluster [WRN] Health check failed: 1 estimated journal reply time (MDS_ESTIMATED_REPLAY_TIME) ./out/mon.a.log:2023-08-01T06:19:08.484+0530 7fee56b5a700 7 mon.a@0(leader).log v78 update_from_paxos applying incremental log 78 2023-08-01T06:19:07.672661+0530 mon.a (mon.0) 257 : cluster [WRN] Health check failed: 1 estimated journal reply time (MDS_ESTIMATED_REPLAY_TIME) ./out/mon.a.log:2023-08-01T06:19:09.692+0530 7fee5bb64700 0 log_channel(cluster) log [INF] : Health check cleared: MDS_ESTIMATED_REPLAY_TIME (was: 1 estimated journal reply time) ./out/mon.a.log:2023-08-01T06:19:09.696+0530 7fee5935f700 10 mon.a@0(leader).log v79 logging 2023-08-01T06:19:09.696497+0530 mon.a (mon.0) 268 : cluster [INF] Health check cleared: MDS_ESTIMATED_REPLAY_TIME (was: 1 estimated journal reply time) ./out/mon.a.log:2023-08-01T06:19:10.560+0530 7fee56b5a700 7 mon.a@0(leader).log v80 update_from_paxos applying incremental log 80 2023-08-01T06:19:09.696497+0530 mon.a (mon.0) 268 : cluster [INF] Health check cleared: MDS_ESTIMATED_REPLAY_TIME (was: 1 estimated journal reply time)
Error in unit test
2023-08-01 06:19:43,434.434 INFO:__main__: 2023-08-01 06:19:43,434.434 INFO:__main__:---------------------------------------------------------------------- 2023-08-01 06:19:43,435.435 INFO:__main__:Ran 1 test in 72.599s 2023-08-01 06:19:43,435.435 INFO:__main__: 2023-08-01 06:19:43,435.435 INFO:__main__:FAILED (errors=1) 2023-08-01 06:19:43,435.435 INFO:__main__: 2023-08-01 06:19:43,435.435 INFO:__main__: 2023-08-01 06:19:43,435.435 INFO:__main__:====================================================================== 2023-08-01 06:19:43,436.436 INFO:__main__:ERROR: test_replay_beacon_estimated_time (tasks.cephfs.test_failover.TestFailoverBeaconHealth) 2023-08-01 06:19:43,436.436 INFO:__main__:That beacon emits warning message with estimated time to complete replay 2023-08-01 06:19:43,436.436 INFO:__main__:---------------------------------------------------------------------- 2023-08-01 06:19:43,436.436 INFO:__main__:Traceback (most recent call last): 2023-08-01 06:19:43,436.436 INFO:__main__: File "/home/manish/work/ibm/ceph/ceph/qa/tasks/cephfs/test_failover.py", line 328, in test_replay_beacon_estimated_time 2023-08-01 06:19:43,436.436 INFO:__main__: self.wait_for_health("MDS_ESTIMATED_REPLAY_TIME", 20) 2023-08-01 06:19:43,436.436 INFO:__main__: File "/home/manish/work/ibm/ceph/ceph/qa/tasks/ceph_test_case.py", line 184, in wait_for_health 2023-08-01 06:19:43,437.437 INFO:__main__: self.wait_until_true(seen_health_warning, timeout) 2023-08-01 06:19:43,437.437 INFO:__main__: File "/home/manish/work/ibm/ceph/ceph/qa/tasks/ceph_test_case.py", line 231, in wait_until_true 2023-08-01 06:19:43,437.437 INFO:__main__: raise TestTimeoutError("Timed out after {0}s and {1} retries".format(elapsed, retry_count)) 2023-08-01 06:19:43,437.437 INFO:__main__:tasks.ceph_test_case.TestTimeoutError: Timed out after 20s and 0 retries 2023-08-01 06:19:43,437.437 INFO:__main__: Using guessed paths /home/manish/work/ibm/ceph/ceph/main_build/lib/ ['/home/manish/work/ibm/ceph/ceph/qa', '/home/manish/work/ibm/ceph/ceph/main_build/lib/cython_modules/lib.3', '/home/manish/work/ibm/ceph/ceph/src/pybind'] Using guessed paths /home/manish/work/ibm/ceph/ceph/main_build/lib/ ['/home/manish/work/ibm/ceph/ceph/qa', '/home/manish/work/ibm/ceph/ceph/main_build/lib/cython_modules/lib.3', '/home/manish/work/ibm/ceph/ceph/src/pybind']

@github-actions
Copy link

github-actions bot commented Aug 2, 2023

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@vshankar
Copy link
Contributor

vshankar commented Aug 3, 2023

@manishym On this today

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manishym Regarding the test failure - its possible that the health warning showed up but wait_for_health missed it since it uses wait_until_true which has a (default) sleep period of 5s. You can do a couple of things here:

  • dump the health metric in wait_for_health every time to verify that the warning does show up
  • (preferably) increase the data set size to increase the time spend by the MDS in up:replay and thereby the health warning will stick around for longer and not to be missed by the health check helper.

Also, I suggest that include an additional check in the test to verify if the health warning gets cleared when the MDS transitions to up:active.

@vshankar
Copy link
Contributor

@manishym ping?

@manishym manishym force-pushed the log_estimated_journal_replay_time branch from c2b76e2 to 6e197ae Compare August 14, 2023 12:02
@manishym manishym requested review from batrick and vshankar August 14, 2023 12:21
@vshankar
Copy link
Contributor

@manishym I guess this is waiting for an update to be pushed since many comments are still pending to be resolved, yes?

@manishym manishym requested a review from a team as a code owner August 28, 2023 10:53
@manishym
Copy link
Collaborator Author

@vshankar I have addressed all the comments.

@vshankar
Copy link
Contributor

@vshankar I have addressed all the comments.

Will have a look tomorrow.

@manishym manishym force-pushed the log_estimated_journal_replay_time branch 2 times, most recently from f663b9a to 82af065 Compare August 29, 2023 09:49
@manishym manishym force-pushed the log_estimated_journal_replay_time branch from 82af065 to 51839cf Compare August 29, 2023 12:25
Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

@manishym manishym force-pushed the log_estimated_journal_replay_time branch from 51839cf to 11d8455 Compare September 5, 2023 09:58
@vshankar vshankar changed the title MDLog: Estimated time to complete replay mds: emit warning with estimated replay time Sep 6, 2023
@manishym manishym force-pushed the log_estimated_journal_replay_time branch from a1da91d to 4f7232b Compare November 6, 2023 11:43
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there!

@manishym manishym force-pushed the log_estimated_journal_replay_time branch from 4f7232b to 7a0595a Compare November 7, 2023 10:17
@manishym manishym requested a review from batrick November 7, 2023 10:17
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small simplification:

*
When MDS might take time replaying the journal. It is helpful to get an
estimate of how much time it might take to finish replaying the journal.
*
Fixes: https://tracker.ceph.com/issues/61863
Signed-off-by: Manish M Yathnalli <myathnal@redhat.com>
@manishym manishym force-pushed the log_estimated_journal_replay_time branch from 7a0595a to 2971358 Compare November 7, 2023 17:18
@manishym manishym requested review from batrick and kotreshhr November 7, 2023 17:18
@vshankar
Copy link
Contributor

jenkins test make check

@vshankar
Copy link
Contributor

Test runs in ~2h - wip-vshankar-testing-20231127.102654

@vshankar
Copy link
Contributor

jenkins test make check

@vshankar
Copy link
Contributor

vshankar commented Dec 4, 2023

https://pulpito.ceph.com/?branch=wip-vshankar-testing-20231127.102654

(rhel pkg install failures are a bunch, so, this would need a revalidate)

@vshankar
Copy link
Contributor

@vshankar
Copy link
Contributor

jenkins test make check

self.fs.fail()
self.fs.set_joinable()

def test_replay_beacon_estimated_time(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a (probably unrelated) test failure here: https://pulpito.ceph.com/vshankar-2024-01-10_15:00:23-fs-wip-vshankar-testing-20240103.072409-1-testing-default-smithi/7511461/

Which hints at a kernel umount hang. I've requested @lxbsz to have a look. However, the subsequent tests didn't run since the test case aborted on the failed test, so I'll rerun specific test once @lxbsz updates. Rest of the fs suite tests look good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the umount hang is due to the MDS still in replay since the test case modified a config to slow down replay. The config needs to be reset and the test should wait for the MDS to be back active before ending (client unmounting).

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@vshankar
Copy link
Contributor

vshankar commented Feb 8, 2024

@manishym I'll fixup the failures are push a change. Since I cannot push an update to a PR owned by you, I'll push a new PR with this change and fixes on top. I'll preserve the contribution tags of course :)

@vshankar
Copy link
Contributor

Superseeded by #55616 #55616

@vshankar vshankar closed this Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants