Skip to content

mds: fix rank 0 marked damaged if stopping fails after Elid flush.#65483

Merged
vshankar merged 1 commit intoceph:mainfrom
ethanwu-syno:fix-rank0-stopping
Sep 29, 2025
Merged

mds: fix rank 0 marked damaged if stopping fails after Elid flush.#65483
vshankar merged 1 commit intoceph:mainfrom
ethanwu-syno:fix-rank0-stopping

Conversation

@ethanwu-syno
Copy link
Contributor

@ethanwu-syno ethanwu-syno commented Sep 11, 2025

way to reproduce
../src/vstart.sh --debug --new -x --localhost --bluestore
./bin/ceph tell mds.<rank 0> config set mds_kill_shutdown_at 10
./bin/ceph fs set down true

wait for a few seconds and will see the following log from take-over mds and rank 0 is mark damaged
2025-09-11T16:47:24.591+0800 785dabeaa6c0 -1 log_channel(cluster) log [ERR] : No subtrees found for root MDS rank! 2025-09-11T16:47:24.591+0800 785dabeaa6c0 5 mds.beacon.b set_want_state: up:rejoin -> down:damaged

During shutdown_pass after submitting Elid and trimming mdlog, mds log will now have only ELid event which doesn't do anything at replay. so after replay, no subtree is found.

Fix this by checking MDLog if there's only ELid event. If so, don't check subtree for rank 0, and let it request STATE_STOPPED just as other rank does.

Fixes: https://tracker.ceph.com/issues/72983

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

@github-actions github-actions bot added the cephfs Ceph File System label Sep 11, 2025
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great bug report/reproducer, thanks!

// funny case: is our cache empty? no subtrees?
if (!mdcache->is_subtrees()) {
if (whoami == 0) {
if (whoami == 0 && !mdlog->is_elid_only_journal()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (whoami == 0 && !mdlog->is_elid_only_journal()) {
if (whoami == 0 && mdlog->get_num_events() > 1) {

is sufficient I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works and is more concise. I apply the suggested changes and add comments. thanks!

… log trimmed

steps to reproduce
 ../src/vstart.sh --debug --new -x --localhost --bluestore
 ./bin/ceph tell mds.<rank 0> config set mds_kill_shutdown_at 10
 ./bin/ceph fs set <fs name> down true

wait for a few seconds and will see the following log from take-over mds
and rank 0 is marked damaged
2025-09-11T16:47:24.591+0800 785dabeaa6c0 -1 log_channel(cluster) log [ERR] : No subtrees found for root MDS rank!
2025-09-11T16:47:24.591+0800 785dabeaa6c0 5 mds.beacon.b set_want_state: up:rejoin -> down:damaged

During shutdown_pass after submitting Elid and trimming mdlog, mds log
will now have only ELid event which does nothing at replay.
After replay, no subtree is found.

Fix this by checking whther MDLog contains only one event.
If so, skip the subtree check for rank 0, and allow it to request
STATE_STOPPED just like the other ranks.

Fixes: https://tracker.ceph.com/issues/72983
Signed-off-by: ethanwu <ethanwu@synology.com>
@vshankar
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/73082.

vshankar added a commit to vshankar/ceph that referenced this pull request Sep 26, 2025
* refs/pull/65483/head:

Reviewed-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
@vshankar
Copy link
Contributor

This is good to merge 👍

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar vshankar merged commit 9b3bfb6 into ceph:main Sep 29, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System needs-qa

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants