Skip to content

mon,cephfs: require confirmation flag to bring down unhealthy MDS#56066

Merged
rishabh-d-dave merged 9 commits intoceph:mainfrom
rishabh-d-dave:mds-fail-confirm
May 3, 2024
Merged

mon,cephfs: require confirmation flag to bring down unhealthy MDS#56066
rishabh-d-dave merged 9 commits intoceph:mainfrom
rishabh-d-dave:mds-fail-confirm

Conversation

@rishabh-d-dave
Copy link
Contributor

@rishabh-d-dave rishabh-d-dave commented Mar 8, 2024

For an MDS that is unhealthy due to, MDS_HEALTH_CACHE_OVERSIZED or
MDS_HEALTH_TRIM, user must pass confirmation flag. When user doesn't
print fail the command and print appropriate error message.

Fixes: https://tracker.ceph.com/issues/61866

Checklist

  • Tracker (select at least one)
    • References tracker ticket
  • Component impact
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • No doc update is appropriate
  • Tests (select at least one)
    • No tests
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@rishabh-d-dave rishabh-d-dave requested a review from a team as a code owner March 8, 2024 15:42
@rishabh-d-dave rishabh-d-dave added the cephfs Ceph File System label Mar 8, 2024
@rishabh-d-dave rishabh-d-dave changed the title mon: require confirmation flag to bring down unhealthy MDS mon,cephfs: require confirmation flag to bring down unhealthy MDS Mar 8, 2024
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will also want to:

  • block fs fail
  • add a unit test in qa/tasks/cephfs/ which synthetically forces the mds journal to not trim and induce the mds to throw MDS_HEALTH_TRIM warnings
  • ditto for MDS_HEALTH_CACHE_OVERSIZED
  • add pending release note
  • update documentation

@rishabh-d-dave
Copy link
Contributor Author

You will also want to:

* block `fs fail`

* add a unit test in qa/tasks/cephfs/ which synthetically forces the mds journal to not trim and induce the mds to throw `MDS_HEALTH_TRIM` warnings

* ditto for `MDS_HEALTH_CACHE_OVERSIZED`

* add pending release note

* update documentation

Yes. Thanks for the review, I wanted a confirmation from you or Venky before proceeding further. :)

@rishabh-d-dave rishabh-d-dave force-pushed the mds-fail-confirm branch 2 times, most recently from a79b137 to 7e4076e Compare March 12, 2024 19:49
@rishabh-d-dave rishabh-d-dave requested review from a team and batrick March 13, 2024 08:47
@rishabh-d-dave
Copy link
Contributor Author

rishabh-d-dave commented Mar 13, 2024

I have updated docs and added release notes and have tested this PR using the script I've written for other MDS PR. This PR is working fine. So, only adding tests is left.

I plan to write 4 tests. Test template for health warning X and test command Y fails without confirmation flag and then test that command Y fails with confirmation flag. We have 2 health warnings, MDS_HEALTH_TRIM and MDS_HEALTH_CACHE_OVERSIZED, and 2 commands, ceph mds fail and ceph fs fail, which makes us have 4 tests.

@rishabh-d-dave rishabh-d-dave force-pushed the mds-fail-confirm branch 3 times, most recently from 97212ea to 738bfa7 Compare March 13, 2024 09:52
@rishabh-d-dave rishabh-d-dave requested a review from a team as a code owner March 13, 2024 09:52
@github-actions github-actions bot added the tests label Mar 26, 2024
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@rishabh-d-dave rishabh-d-dave force-pushed the mds-fail-confirm branch 4 times, most recently from 68d0af7 to 54d0d1e Compare April 7, 2024 12:02
@rishabh-d-dave rishabh-d-dave force-pushed the mds-fail-confirm branch 5 times, most recently from 519f7de to 09e5dd8 Compare April 24, 2024 07:25
@rishabh-d-dave
Copy link
Contributor Author

jenkins test make check

@rishabh-d-dave
Copy link
Contributor Author

Failures were unrelated - https://jenkins.ceph.com/job/ceph-pull-requests/133848/

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep tests as simple as possible

@Svelar
Copy link
Member

Svelar commented Apr 25, 2024

jenkins test make check arm64

Add tests to verify that the confirmation flag is mandatory for running
commands "ceph mds fail" and "ceph fs fail" when MDS has one of the two
health warnings: MDS_CACHE_OVERSIZE or MDS_TRIM.

Also, add MDS_CACHE_OVERSIZE and MDS_TRIM to ignorelist for
test_admin.py so that QA jobs knows this an expected failure.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
@rishabh-d-dave
Copy link
Contributor Author

rishabh-d-dave commented Apr 25, 2024

It is running fine with teuthology as well as vstart_runner.py, picking this PR for QA.

rishabh-d-dave added a commit to rishabh-d-dave/ceph that referenced this pull request May 3, 2024
* refs/pull/56066/head:
	qa/cephfs: add tests failing MDS and FS when MDS is unhealthy
	qa/cephfs: pass confirmation flag to fs fail in tear down code
	PendingReleaseNotes: note need of confirmation for "ceph fs fail"
	doc/cephfs: mention need of confirmation for "ceph fs fail"
	cephfs,mon: require confirmation to fail unhealthy FS
	qa/cephfs: update filesystem.Filesystem.rank_fail()
	PendingReleaseNotes: note need of confirmation for "ceph mds fail"
	doc/cephfs: mention need of confirmation for "ceph mds fail"
	cephfs,mon: require confirmation to fail unhealthy MDS

Reviewed-by: Leonid Usov <leonid.usov@ibm.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Copy link
Contributor Author

@rishabh-d-dave rishabh-d-dave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA run was successful - https://tracker.ceph.com/projects/cephfs/wiki/main#3-May-2024.

Testing took more time than expected because there were 25-30 new failures. Most of them caused by a PR in the testing branch but these were resolved on removing that PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants