mds: fix bug and error handling of mds STATE_STARTING by ethanwu-syno · Pull Request #56429 · ceph/ceph

ethanwu-syno · 2024-03-24T11:46:44Z

Fixes: https://tracker.ceph.com/issues/65094

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

…TARTING Just like STATE_CREATING, mds could fail or being stopped any where at STATE_STARTING state, so make sure subsequent take-over mds will start from STATE_STARTING. Otherwise, we'll end up with empty journal(No ESubtreeMap). The subsequent take-over mds will fail with no subtrees found and rank will be marked damaged. Quick way to reproduce this: ./bin/ceph fs set a down true # take down all rank in filesystem a #wait for fs to stop all rank ./bin/ceph fs set a down true; pidof ceph-mds | xargs kill # quickly kill all mds soon after they enter starting state ./bin/ceph-mds -i a -c ./ceph.conf # start all mds. Then we'll find out that mds rank is reported damaged with following log -1 log_channel(cluster) log [ERR] : No subtrees found for root MDS rank! 5 mds.beacon.a set_want_state: up:rejoin -> down:damaged Fixes: https://tracker.ceph.com/issues/65094 Signed-off-by: ethanwu <ethanwu@synology.com>

If we donn't flush mds log before requesting STATE_ACTIVE, and mds happens to stop later before the log reaches journal. The take-over mds will have no SubtreeMap to replay, and fail later at non-empty subtree check. Fixes: https://tracker.ceph.com/issues/65094 Signed-off-by: ethanwu <ethanwu@synology.com>

…starting Root ino belongs to subtree of root rank, and should be inserted when creating subtree map log. This is missing when mds runs at STATE_STARTING, however. When doing replay, all inode under this subtree will be trimmed by trim_non_auth_subtree and cause replay failure. Quick way to reproduce this: After creating filesystem, mount it and create some directory. mkdir -p ${cephfs_root}/dir1/dir11/foo mkdir -p ${cephfs_root}/dir1/dir11/bar unmount cephfs ./bin/ceph fs set a down true ./bin/ceph fs set a down false ./bin/cephfs-journal-tool --rank=a:0 event get json --path output # Can see that ESubtreeMap only contains 0x100 but no 0x1 mount cephfs rmdir ${cephfs_root}/dir1/dir11/foo rmdir ${cephfs_root}/dir1/dir11/bar unmount cephfs trigger mds rank 0 failover, and you can find rank 0 fails during replay and is marked damaged Check mds log will find the following related message: -49> 2024-03-24T18:06:19.461+0800 7f1542cbf700 10 mds.0.cache trim_non_auth_subtree(0x560372b2df80) [dir 0x1 / [2,head] auth v=12 cv=0/0 dir_auth=-2 state=1073741824 f(v0 m2024-03-24T18:03:30.350260+0800 1=0+1) n(v3 rc2024-03-24T18:03:30.401819+0800 4=0+4) hs=1+0,ss=0+0 | child=1 subtree=1 0x560372b2df80] -27> 2024-03-24T18:06:19.461+0800 7f1542cbf700 14 mds.0.cache remove_inode [inode 0x10000000000 [...2,head] #10000000000/ auth v10 f(v0 m2024-03-24T18:03:30.378677+0800 1=0+1) n(v1 rc2024-03-24T18:03:30.401819+0800 4=0+4) (iversion lock) 0x560372c52100] -21> 2024-03-24T18:06:19.461+0800 7f1542cbf700 10 mds.0.log _replay 4216759~3161 / 4226491 2024-03-24T18:05:16.515314+0800: EUpdate unlink_local [metablob 0x10000000000, 4 dirs] -20> 2024-03-24T18:06:19.461+0800 7f1542cbf700 10 mds.0.journal EUpdate::replay -19> 2024-03-24T18:06:19.461+0800 7f1542cbf700 10 mds.0.journal EMetaBlob.replay 4 dirlumps by unknown.0 -18> 2024-03-24T18:06:19.461+0800 7f1542cbf700 10 mds.0.journal EMetaBlob.replay don't have renamed ino 0x10000000003 -17> 2024-03-24T18:06:19.461+0800 7f1542cbf700 10 mds.0.journal EMetaBlob.replay found null dentry in dir 0x10000000001 -16> 2024-03-24T18:06:19.461+0800 7f1542cbf700 10 mds.0.journal EMetaBlob.replay dir 0x10000000000 -15> 2024-03-24T18:06:19.461+0800 7f1542cbf700 0 mds.0.journal EMetaBlob.replay missing dir ino 0x10000000000 -14> 2024-03-24T18:06:19.461+0800 7f1542cbf700 -1 log_channel(cluster) log [ERR] : failure replaying journal (EMetaBlob) -13> 2024-03-24T18:06:19.461+0800 7f1542cbf700 5 mds.beacon.c set_want_state: up:replay -> down:damaged The way to fix this is refering to how mdsdir inode is handled when MDS enter STARTING. Fixes: https://tracker.ceph.com/issues/65094 Signed-off-by: ethanwu <ethanwu@synology.com>

vshankar · 2024-03-26T12:24:31Z

src/mds/FSMap.cc

        // the rank ever existed so that next time it's handed out
        // to a gid it'll go back into CREATING.
        fs.mds_map.in.erase(info.rank);
+      } else if (info.state == MDSMap::STATE_STARTING) {


This seems to be an ideal place to introduce kill points. @batrick WDYT?

I think you mean in MDSRank during up:starting? Probably a good idea, yes. The MDSThrasher could set various killpoints during workload testing too.

That's correct.

batrick · 2024-03-26T14:42:42Z

src/mds/MDSRank.cc

-  mdlog->submit_entry(sle);
-
  // sync snaptable cache
  snapclient->sync(new C_MDSInternalNoop);


While you're here, we should also require the snapclient is sync'd before entering active. Please also move this and use MDSGatherBuilder to trigger ::starting_done.

I leave it here because the order is the same as in creating_done. Should that be fixed as well or only starting state has to sync snap cache before requesting active?

both should be fixed I think.

It seems that snapclient sync cannot be called before requesting active. It doesn't work for rank that also runs as tableserver. snapclient->sync won't finish until MDSTableClient::server_ready is true, here.
Yet server_ready is set only when state of tabeserver's rank > STATE_RESOLVE. Both STATE_STARTING and STATE_CREATING < STATE_RESOLVE. The gather won't finish.

I'm not familiar with snap client, but what snapclient->sync does seems to query state from snap server and cache the result. If mds stops after requesting STATE_ACTIVE, the take over mds will start from REPLAY->RESOLVE, and in MDSRank::resolve_done, it'll call snapclient->sync to reconstruct the snapclient cache again. I think it's OK to leave snapclient->sync unchanged. Looks like it has nothing to do with STARTING and CREATING state.

I agree with @ethanwu-syno to leave this bit as it is. @batrick WDYT?

You're right, let's not block this due to this.

batrick · 2024-03-26T14:43:52Z

src/mds/FSMap.cc

        // the rank ever existed so that next time it's handed out
        // to a gid it'll go back into CREATING.
        fs.mds_map.in.erase(info.rank);
+      } else if (info.state == MDSMap::STATE_STARTING) {


I think you mean in MDSRank during up:starting? Probably a good idea, yes. The MDSThrasher could set various killpoints during workload testing too.

batrick · 2024-05-30T18:32:05Z

jenkins test make check arm64

vshankar · 2024-04-10T07:31:10Z

src/mds/MDSRank.cc

-  mdlog->submit_entry(sle);
-
  // sync snaptable cache
  snapclient->sync(new C_MDSInternalNoop);


I agree with @ethanwu-syno to leave this bit as it is. @batrick WDYT?

vshankar · 2024-06-03T11:03:33Z

This PR is under test in https://tracker.ceph.com/issues/66327.

batrick · 2024-06-11T17:01:01Z

jenkins test make check arm64

dparmar18 · 2024-06-13T14:19:51Z

The tracker mentions the way this could be reproduced; won't it be a good idea to have a test case then?

vshankar · 2024-06-17T05:19:43Z

The tracker mentions the way this could be reproduced; won't it be a good idea to have a test case then?

Good question. IIRC, this was discussed in some comment(?) and reproducing this involved timing.

vshankar · 2024-06-17T09:43:58Z

This PR is under test in https://tracker.ceph.com/issues/66522.

vshankar

https://tracker.ceph.com/projects/cephfs/wiki/Main#wip-vshankar-testing-20240617094253-debug

(https://tracker.ceph.com/issues/66522)

github-actions bot added the cephfs Ceph File System label Mar 24, 2024

ethanwu-syno force-pushed the mds_fix_starting_state branch from ee95e41 to 0b79f99 Compare March 24, 2024 11:50

ethanwu-syno added 2 commits March 25, 2024 09:32

ethanwu-syno force-pushed the mds_fix_starting_state branch from 0b79f99 to 463c3b7 Compare March 25, 2024 01:34

vshankar requested a review from a team March 25, 2024 12:52

vshankar reviewed Mar 26, 2024

View reviewed changes

batrick requested changes Mar 26, 2024

View reviewed changes

batrick approved these changes May 30, 2024

View reviewed changes

batrick added the needs-qa label May 30, 2024

vshankar approved these changes Jun 3, 2024

View reviewed changes

vshankar added the wip-vshankar-testing3 label Jun 3, 2024

vshankar approved these changes Jun 24, 2024

View reviewed changes

vshankar merged commit 9120b86 into ceph:main Jun 25, 2024

vshankar removed the wip-vshankar-testing3 label Jun 25, 2024

mchangir mentioned this pull request Jan 25, 2025

reef: mds: fix bug and error handling of mds STATE_STARTING #61517

Open

gardran mentioned this pull request May 1, 2025

reef: mds: fix bug and error handling of mds STATE_STARTING (alternate backport) #63086

Closed

14 tasks

Conversation

ethanwu-syno commented Mar 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ethanwu-syno Mar 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

batrick commented May 30, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vshankar commented Jun 3, 2024

Uh oh!

batrick commented Jun 11, 2024

Uh oh!

dparmar18 commented Jun 13, 2024

Uh oh!

vshankar commented Jun 17, 2024

Uh oh!

vshankar commented Jun 17, 2024

Uh oh!

vshankar left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ethanwu-syno commented Mar 24, 2024 •

edited

Loading

ethanwu-syno Mar 27, 2024 •

edited

Loading