mon,cephfs: require confirmation flag to bring down unhealthy MDS#56066
mon,cephfs: require confirmation flag to bring down unhealthy MDS#56066rishabh-d-dave merged 9 commits intoceph:mainfrom
Conversation
50b85f5 to
ecf1f46
Compare
batrick
left a comment
There was a problem hiding this comment.
You will also want to:
- block
fs fail - add a unit test in qa/tasks/cephfs/ which synthetically forces the mds journal to not trim and induce the mds to throw
MDS_HEALTH_TRIMwarnings - ditto for
MDS_HEALTH_CACHE_OVERSIZED - add pending release note
- update documentation
Yes. Thanks for the review, I wanted a confirmation from you or Venky before proceeding further. :) |
a79b137 to
7e4076e
Compare
7e4076e to
93902b6
Compare
|
I have updated docs and added release notes and have tested this PR using the script I've written for other MDS PR. This PR is working fine. So, only adding tests is left. I plan to write 4 tests. Test template for health warning X and test command Y fails without confirmation flag and then test that command Y fails with confirmation flag. We have 2 health warnings, |
97212ea to
738bfa7
Compare
738bfa7 to
48403f4
Compare
48403f4 to
6b2acd0
Compare
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
68d0af7 to
54d0d1e
Compare
519f7de to
09e5dd8
Compare
|
jenkins test make check |
|
Failures were unrelated - https://jenkins.ceph.com/job/ceph-pull-requests/133848/ |
09e5dd8 to
fcb39a9
Compare
batrick
left a comment
There was a problem hiding this comment.
please keep tests as simple as possible
|
jenkins test make check arm64 |
Add tests to verify that the confirmation flag is mandatory for running commands "ceph mds fail" and "ceph fs fail" when MDS has one of the two health warnings: MDS_CACHE_OVERSIZE or MDS_TRIM. Also, add MDS_CACHE_OVERSIZE and MDS_TRIM to ignorelist for test_admin.py so that QA jobs knows this an expected failure. Signed-off-by: Rishabh Dave <ridave@redhat.com>
fcb39a9 to
214d614
Compare
|
It is running fine with teuthology as well as vstart_runner.py, picking this PR for QA. |
* refs/pull/56066/head: qa/cephfs: add tests failing MDS and FS when MDS is unhealthy qa/cephfs: pass confirmation flag to fs fail in tear down code PendingReleaseNotes: note need of confirmation for "ceph fs fail" doc/cephfs: mention need of confirmation for "ceph fs fail" cephfs,mon: require confirmation to fail unhealthy FS qa/cephfs: update filesystem.Filesystem.rank_fail() PendingReleaseNotes: note need of confirmation for "ceph mds fail" doc/cephfs: mention need of confirmation for "ceph mds fail" cephfs,mon: require confirmation to fail unhealthy MDS Reviewed-by: Leonid Usov <leonid.usov@ibm.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
There was a problem hiding this comment.
QA run was successful - https://tracker.ceph.com/projects/cephfs/wiki/main#3-May-2024.
Testing took more time than expected because there were 25-30 new failures. Most of them caused by a PR in the testing branch but these were resolved on removing that PR.
Requested changes were incorporated.
For an MDS that is unhealthy due to, MDS_HEALTH_CACHE_OVERSIZED or
MDS_HEALTH_TRIM, user must pass confirmation flag. When user doesn't
print fail the command and print appropriate error message.
Fixes: https://tracker.ceph.com/issues/61866
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e