mds: fix rank 0 marked damaged if stopping fails after Elid flush.#65483
Merged
mds: fix rank 0 marked damaged if stopping fails after Elid flush.#65483
Conversation
batrick
requested changes
Sep 11, 2025
Member
batrick
left a comment
There was a problem hiding this comment.
Great bug report/reproducer, thanks!
src/mds/MDSRank.cc
Outdated
| // funny case: is our cache empty? no subtrees? | ||
| if (!mdcache->is_subtrees()) { | ||
| if (whoami == 0) { | ||
| if (whoami == 0 && !mdlog->is_elid_only_journal()) { |
Member
There was a problem hiding this comment.
Suggested change
| if (whoami == 0 && !mdlog->is_elid_only_journal()) { | |
| if (whoami == 0 && mdlog->get_num_events() > 1) { |
is sufficient I think.
Contributor
Author
There was a problem hiding this comment.
It works and is more concise. I apply the suggested changes and add comments. thanks!
… log trimmed steps to reproduce ../src/vstart.sh --debug --new -x --localhost --bluestore ./bin/ceph tell mds.<rank 0> config set mds_kill_shutdown_at 10 ./bin/ceph fs set <fs name> down true wait for a few seconds and will see the following log from take-over mds and rank 0 is marked damaged 2025-09-11T16:47:24.591+0800 785dabeaa6c0 -1 log_channel(cluster) log [ERR] : No subtrees found for root MDS rank! 2025-09-11T16:47:24.591+0800 785dabeaa6c0 5 mds.beacon.b set_want_state: up:rejoin -> down:damaged During shutdown_pass after submitting Elid and trimming mdlog, mds log will now have only ELid event which does nothing at replay. After replay, no subtree is found. Fix this by checking whther MDLog contains only one event. If so, skip the subtree check for rank 0, and allow it to request STATE_STOPPED just like the other ranks. Fixes: https://tracker.ceph.com/issues/72983 Signed-off-by: ethanwu <ethanwu@synology.com>
65e09c7 to
adb448b
Compare
batrick
approved these changes
Sep 12, 2025
vshankar
approved these changes
Sep 16, 2025
Contributor
|
This PR is under test in https://tracker.ceph.com/issues/73082. |
vshankar
added a commit
to vshankar/ceph
that referenced
this pull request
Sep 26, 2025
* refs/pull/65483/head: Reviewed-by: Venky Shankar <vshankar@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
Contributor
|
This is good to merge 👍 |
vshankar
approved these changes
Sep 29, 2025
This was referenced Oct 4, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
way to reproduce
../src/vstart.sh --debug --new -x --localhost --bluestore
./bin/ceph tell mds.<rank 0> config set mds_kill_shutdown_at 10
./bin/ceph fs set down true
wait for a few seconds and will see the following log from take-over mds and rank 0 is mark damaged
2025-09-11T16:47:24.591+0800 785dabeaa6c0 -1 log_channel(cluster) log [ERR] : No subtrees found for root MDS rank! 2025-09-11T16:47:24.591+0800 785dabeaa6c0 5 mds.beacon.b set_want_state: up:rejoin -> down:damaged
During shutdown_pass after submitting Elid and trimming mdlog, mds log will now have only ELid event which doesn't do anything at replay. so after replay, no subtree is found.
Fix this by checking MDLog if there's only ELid event. If so, don't check subtree for rank 0, and let it request STATE_STOPPED just as other rank does.
Fixes: https://tracker.ceph.com/issues/72983
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job Definition