Skip to content

quincy: mon,cephfs: require confirmation flag to bring down unhealthy MDS#57841

Merged
vshankar merged 11 commits intoceph:quincyfrom
rishabh-d-dave:wip-65928-quincy
Nov 28, 2024
Merged

quincy: mon,cephfs: require confirmation flag to bring down unhealthy MDS#57841
vshankar merged 11 commits intoceph:quincyfrom
rishabh-d-dave:wip-65928-quincy

Conversation

@rishabh-d-dave
Copy link
Contributor

@rishabh-d-dave rishabh-d-dave commented Jun 3, 2024

When running the command "ceph mds fail" for an MDS that is unhealthy
due to, MDS_CACHE_OVERSIZED or MDS_TRIM, user must pass confirmation
flag. Else, the command will fail and print an appropriate error
message.

Restarting an MDS with such health warnings is not recommended since it
will have a slow reocvery during restart which will create new problems.

Fixes: https://tracker.ceph.com/issues/61866
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit eeda00e)
Update docs since command "ceph mds fail" will now fail if MDS has either
health warning MDS_TRIM or MDS_CACHE_OVERSIZED and if confirmation flag
is not passed.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit dea2220)
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit f241a3c)
Since the command "ceph mds fail" now may require confirmation flag
("--yes-i-really-mean-it"), update this method to allow/disallow adding
this flag to the command arguments.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 4f333e1)
@rishabh-d-dave rishabh-d-dave requested review from a team as code owners June 3, 2024 13:53
@rishabh-d-dave rishabh-d-dave added this to the quincy milestone Jun 3, 2024
Confirmation flag must be passed when running the command "ceph fs fail"
when the MDS for this FS has either of the two health warnings: MDS_TRIM
or MDS_CACHE_OVERSIZED. Else, the command will fail and print an
appropriate error message.

Restarting an MDS with these health warnings is not recommened since it
will have a slow recovery during restart which will create new problems.

Fixes: https://tracker.ceph.com/issues/61866
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit b901616)

Conflicts:
- src/mon/FSCommands.cc
  Lines around the patch are different in quincy compared to main branch.
  "get_mds_map()" is not available in quincy branch, unlike main.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit de18c5a)
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 2481642)
Since "ceph fs fail" command now requires the confirmation flag when
Ceph cluster has either health warning MDS_TRIM or MDS_CACHE_OVERSIZE,
update tear down in QA code. During the teardown, the CephFS should be
failed, regardless of whether or not Ceph cluster has health warnings,
since it is teardown.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit a1af1bf)
Add tests to verify that the confirmation flag is mandatory for running
commands "ceph mds fail" and "ceph fs fail" when MDS has one of the two
health warnings: MDS_CACHE_OVERSIZE or MDS_TRIM.

Also, add MDS_CACHE_OVERSIZE and MDS_TRIM to ignorelist for
test_admin.py so that QA jobs knows this an expected failure.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 214d614)
This issue was not caught in original QA run because "ceph mds fail"
returns 0 even though MDS name received by it in argument is
non-existent. This is done for the sake of idempotency, however it
caused this bug to go uncaught.

Fixea: https://tracker.ceph.com/issues/65864
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit ab643f7)
After running TestFSFail, CephFSTestCase.tearDown() fails attempting
to unmount CephFS. Set joinable on FS and wait for the MDS to be up
before exiting the test. This will ensure that unmounting is
successful in teardown.

Fixes: https://tracker.ceph.com/issues/65841
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit faa30e0)
@joscollin joscollin requested a review from a team June 20, 2024 07:49
@joscollin
Copy link
Member

jenkins test windows

@joscollin
Copy link
Member

This PR is under test in https://tracker.ceph.com/issues/66597.

Copy link
Contributor Author

@rishabh-d-dave rishabh-d-dave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljflores ljflores requested a review from vshankar August 19, 2024 16:13
Copy link
Contributor

@rzarzynski rzarzynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in src/mon/MonCommands.h looks good from the core's standpoint but leaving the approval to the @ceph/cephfs.

@vshankar
Copy link
Contributor

Found related failure - https://pulpito.ceph.com/xiubli-2024-08-01_12:01:33-fs-wip-xiubli-testing-20240801.015240-quincy-distro-default-smithi/7830262

I recall that there is a tracker for this, yes? Please link it here @rishabh-d-dave

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Nov 26, 2024
@vshankar vshankar removed the stale label Nov 26, 2024
Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar vshankar merged commit 2d9a4d6 into ceph:quincy Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants