Skip to content

squid: mon,cephfs: require confirmation flag to bring down unhealthy MDS#57840

Merged
joscollin merged 11 commits intoceph:squidfrom
rishabh-d-dave:wip-66330-squid
Jun 17, 2024
Merged

squid: mon,cephfs: require confirmation flag to bring down unhealthy MDS#57840
joscollin merged 11 commits intoceph:squidfrom
rishabh-d-dave:wip-66330-squid

Conversation

@rishabh-d-dave
Copy link
Contributor

@rishabh-d-dave rishabh-d-dave commented Jun 3, 2024

When running the command "ceph mds fail" for an MDS that is unhealthy
due to, MDS_CACHE_OVERSIZED or MDS_TRIM, user must pass confirmation
flag. Else, the command will fail and print an appropriate error
message.

Restarting an MDS with such health warnings is not recommended since it
will have a slow reocvery during restart which will create new problems.

Fixes: https://tracker.ceph.com/issues/61866
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit eeda00e)
Update docs since command "ceph mds fail" will now fail if MDS has either
health warning MDS_TRIM or MDS_CACHE_OVERSIZED and if confirmation flag
is not passed.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit dea2220)
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit f241a3c)
Since the command "ceph mds fail" now may require confirmation flag
("--yes-i-really-mean-it"), update this method to allow/disallow adding
this flag to the command arguments.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 4f333e1)
Confirmation flag must be passed when running the command "ceph fs fail"
when the MDS for this FS has either of the two health warnings: MDS_TRIM
or MDS_CACHE_OVERSIZED. Else, the command will fail and print an
appropriate error message.

Restarting an MDS with these health warnings is not recommened since it
will have a slow recovery during restart which will create new problems.

Fixes: https://tracker.ceph.com/issues/61866
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit b901616)
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit de18c5a)
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 2481642)
Since "ceph fs fail" command now requires the confirmation flag when
Ceph cluster has either health warning MDS_TRIM or MDS_CACHE_OVERSIZE,
update tear down in QA code. During the teardown, the CephFS should be
failed, regardless of whether or not Ceph cluster has health warnings,
since it is teardown.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit a1af1bf)
Add tests to verify that the confirmation flag is mandatory for running
commands "ceph mds fail" and "ceph fs fail" when MDS has one of the two
health warnings: MDS_CACHE_OVERSIZE or MDS_TRIM.

Also, add MDS_CACHE_OVERSIZE and MDS_TRIM to ignorelist for
test_admin.py so that QA jobs knows this an expected failure.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 214d614)
@rishabh-d-dave rishabh-d-dave requested review from a team as code owners June 3, 2024 13:46
@rishabh-d-dave rishabh-d-dave added this to the squid milestone Jun 3, 2024
This issue was not caught in original QA run because "ceph mds fail"
returns 0 even though MDS name received by it in argument is
non-existent. This is done for the sake of idempotency, however it
caused this bug to go uncaught.

Fixea: https://tracker.ceph.com/issues/65864
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit ab643f7)
After running TestFSFail, CephFSTestCase.tearDown() fails attempting
to unmount CephFS. Set joinable on FS and wait for the MDS to be up
before exiting the test. This will ensure that unmounting is
successful in teardown.

Fixes: https://tracker.ceph.com/issues/65841
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit faa30e0)
@rishabh-d-dave
Copy link
Contributor Author

rishabh-d-dave commented Jun 12, 2024

https://jenkins.ceph.com/job/ceph-api/75515/

Identified problems
	
No identified problem
No problems were identified. If you know why this problem occurred, please add a suitable Cause for it.

@rishabh-d-dave
Copy link
Contributor Author

jenkins test api

@joscollin
Copy link
Member

Tested in https://tracker.ceph.com/issues/66423

Copy link
Member

@joscollin joscollin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rishabh-d-dave
Copy link
Contributor Author

@joscollin

@rishabh-d-dave There's a failure to be checked: https://pulpito.ceph.com/leonidus-2024-06-12_09:41:32-fs-wip-lusov-testing-20240611.123850-squid-distro-default-smithi/7751944/

The failing job is fs:workload job and MDS_CACHE_OVERSIZED is added here to fs:functional, it has nothing to do with fs:workload job AFAIS.

@joscollin joscollin merged commit 3f51c89 into ceph:squid Jun 17, 2024
@rishabh-d-dave rishabh-d-dave deleted the wip-66330-squid branch June 19, 2024 10:41
joscollin pushed a commit to joscollin/ceph that referenced this pull request Jun 27, 2024
* refs/pull/57840/head:
	qa/cephfs: set joinable on FS before exiting tests in TestFSFail
	qa/cephfs: pass MDS name, not FS name, to "ceph mds fail" cmd
	qa/cephfs: add tests failing MDS and FS when MDS is unhealthy
	qa/cephfs: pass confirmation flag to fs fail in tear down code
	PendingReleaseNotes: note need of confirmation for "ceph fs fail"
	doc/cephfs: mention need of confirmation for "ceph fs fail"
	cephfs,mon: require confirmation to fail unhealthy FS
	qa/cephfs: update filesystem.Filesystem.rank_fail()
	PendingReleaseNotes: note need of confirmation for "ceph mds fail"
	doc/cephfs: mention need of confirmation for "ceph mds fail"
	cephfs,mon: require confirmation to fail unhealthy MDS

Reviewed-by: Anthony D Atri <anthony.datri@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants