quincy: mon,cephfs: require confirmation flag to bring down unhealthy MDS#57841
quincy: mon,cephfs: require confirmation flag to bring down unhealthy MDS#57841vshankar merged 11 commits intoceph:quincyfrom
Conversation
When running the command "ceph mds fail" for an MDS that is unhealthy due to, MDS_CACHE_OVERSIZED or MDS_TRIM, user must pass confirmation flag. Else, the command will fail and print an appropriate error message. Restarting an MDS with such health warnings is not recommended since it will have a slow reocvery during restart which will create new problems. Fixes: https://tracker.ceph.com/issues/61866 Signed-off-by: Rishabh Dave <ridave@redhat.com> (cherry picked from commit eeda00e)
Update docs since command "ceph mds fail" will now fail if MDS has either health warning MDS_TRIM or MDS_CACHE_OVERSIZED and if confirmation flag is not passed. Signed-off-by: Rishabh Dave <ridave@redhat.com> (cherry picked from commit dea2220)
Signed-off-by: Rishabh Dave <ridave@redhat.com> (cherry picked from commit f241a3c)
Since the command "ceph mds fail" now may require confirmation flag
("--yes-i-really-mean-it"), update this method to allow/disallow adding
this flag to the command arguments.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 4f333e1)
Confirmation flag must be passed when running the command "ceph fs fail" when the MDS for this FS has either of the two health warnings: MDS_TRIM or MDS_CACHE_OVERSIZED. Else, the command will fail and print an appropriate error message. Restarting an MDS with these health warnings is not recommened since it will have a slow recovery during restart which will create new problems. Fixes: https://tracker.ceph.com/issues/61866 Signed-off-by: Rishabh Dave <ridave@redhat.com> (cherry picked from commit b901616) Conflicts: - src/mon/FSCommands.cc Lines around the patch are different in quincy compared to main branch. "get_mds_map()" is not available in quincy branch, unlike main.
Signed-off-by: Rishabh Dave <ridave@redhat.com> (cherry picked from commit de18c5a)
Signed-off-by: Rishabh Dave <ridave@redhat.com> (cherry picked from commit 2481642)
Since "ceph fs fail" command now requires the confirmation flag when Ceph cluster has either health warning MDS_TRIM or MDS_CACHE_OVERSIZE, update tear down in QA code. During the teardown, the CephFS should be failed, regardless of whether or not Ceph cluster has health warnings, since it is teardown. Signed-off-by: Rishabh Dave <ridave@redhat.com> (cherry picked from commit a1af1bf)
76baeb0 to
f8e910b
Compare
Add tests to verify that the confirmation flag is mandatory for running commands "ceph mds fail" and "ceph fs fail" when MDS has one of the two health warnings: MDS_CACHE_OVERSIZE or MDS_TRIM. Also, add MDS_CACHE_OVERSIZE and MDS_TRIM to ignorelist for test_admin.py so that QA jobs knows this an expected failure. Signed-off-by: Rishabh Dave <ridave@redhat.com> (cherry picked from commit 214d614)
This issue was not caught in original QA run because "ceph mds fail" returns 0 even though MDS name received by it in argument is non-existent. This is done for the sake of idempotency, however it caused this bug to go uncaught. Fixea: https://tracker.ceph.com/issues/65864 Signed-off-by: Rishabh Dave <ridave@redhat.com> (cherry picked from commit ab643f7)
After running TestFSFail, CephFSTestCase.tearDown() fails attempting to unmount CephFS. Set joinable on FS and wait for the MDS to be up before exiting the test. This will ensure that unmounting is successful in teardown. Fixes: https://tracker.ceph.com/issues/65841 Signed-off-by: Rishabh Dave <ridave@redhat.com> (cherry picked from commit faa30e0)
f8e910b to
973ff56
Compare
|
jenkins test windows |
|
This PR is under test in https://tracker.ceph.com/issues/66597. |
rzarzynski
left a comment
There was a problem hiding this comment.
The changes in src/mon/MonCommands.h looks good from the core's standpoint but leaving the approval to the @ceph/cephfs.
I recall that there is a tracker for this, yes? Please link it here @rishabh-d-dave |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
BACKPORTS 3 related trackes together
backport tracker: https://tracker.ceph.com/issues/65928
backport tracker https://tracker.ceph.com/issues/66199
backport tracker https://tracker.ceph.com/issues/66410
this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/master/src/script/ceph-backport.sh