Skip to content

reef: mon,cephfs: require confirmation flag to bring down unhealthy MDS#57837

Merged
joscollin merged 11 commits intoceph:reeffrom
rishabh-d-dave:wip-65927-reef
Jun 17, 2024
Merged

reef: mon,cephfs: require confirmation flag to bring down unhealthy MDS#57837
joscollin merged 11 commits intoceph:reeffrom
rishabh-d-dave:wip-65927-reef

Conversation

@rishabh-d-dave
Copy link
Contributor

@rishabh-d-dave rishabh-d-dave commented Jun 3, 2024

When running the command "ceph mds fail" for an MDS that is unhealthy
due to, MDS_CACHE_OVERSIZED or MDS_TRIM, user must pass confirmation
flag. Else, the command will fail and print an appropriate error
message.

Restarting an MDS with such health warnings is not recommended since it
will have a slow reocvery during restart which will create new problems.

Fixes: https://tracker.ceph.com/issues/61866
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit eeda00e)
Update docs since command "ceph mds fail" will now fail if MDS has either
health warning MDS_TRIM or MDS_CACHE_OVERSIZED and if confirmation flag
is not passed.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit dea2220)
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit f241a3c)
Since the command "ceph mds fail" now may require confirmation flag
("--yes-i-really-mean-it"), update this method to allow/disallow adding
this flag to the command arguments.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 4f333e1)
@rishabh-d-dave rishabh-d-dave requested review from a team as code owners June 3, 2024 13:37
@rishabh-d-dave rishabh-d-dave added this to the reef milestone Jun 3, 2024
Confirmation flag must be passed when running the command "ceph fs fail"
when the MDS for this FS has either of the two health warnings: MDS_TRIM
or MDS_CACHE_OVERSIZED. Else, the command will fail and print an
appropriate error message.

Restarting an MDS with these health warnings is not recommened since it
will have a slow recovery during restart which will create new problems.

Fixes: https://tracker.ceph.com/issues/61866
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit b901616)

Conflicts:
- src/mon/FSCommands.cc
  -  lines surrounding the patch are different in reef compared to main.
     the reef code was still accessing "mds_map" directly instead of
     accessing it using "get_mds_map()".
  - return value of get_filesystem() is different in main.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit de18c5a)
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 2481642)
Since "ceph fs fail" command now requires the confirmation flag when
Ceph cluster has either health warning MDS_TRIM or MDS_CACHE_OVERSIZE,
update tear down in QA code. During the teardown, the CephFS should be
failed, regardless of whether or not Ceph cluster has health warnings,
since it is teardown.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit a1af1bf)
Add tests to verify that the confirmation flag is mandatory for running
commands "ceph mds fail" and "ceph fs fail" when MDS has one of the two
health warnings: MDS_CACHE_OVERSIZE or MDS_TRIM.

Also, add MDS_CACHE_OVERSIZE and MDS_TRIM to ignorelist for
test_admin.py so that QA jobs knows this an expected failure.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 214d614)
This issue was not caught in original QA run because "ceph mds fail"
returns 0 even though MDS name received by it in argument is
non-existent. This is done for the sake of idempotency, however it
caused this bug to go uncaught.

Fixea: https://tracker.ceph.com/issues/65864
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit ab643f7)
After running TestFSFail, CephFSTestCase.tearDown() fails attempting
to unmount CephFS. Set joinable on FS and wait for the MDS to be up
before exiting the test. This will ensure that unmounting is
successful in teardown.

Fixes: https://tracker.ceph.com/issues/65841
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit faa30e0)
@rishabh-d-dave
Copy link
Contributor Author

@leonid-s-usov make check passed.

@rishabh-d-dave
Copy link
Contributor Author

@leonid-s-usov

@leonid-s-usov make check passed.

ping

@joscollin joscollin merged commit b464c77 into ceph:reef Jun 17, 2024
@joscollin
Copy link
Member

Tested in https://tracker.ceph.com/issues/66468

@rishabh-d-dave rishabh-d-dave deleted the wip-65927-reef branch June 19, 2024 04:36
@lxbsz
Copy link
Member

lxbsz commented Jul 4, 2024

@rishabh-d-dave This PR caused the qa test failure, it seems you didn't backport the dependency commit:

commit 29610577eece04c028c412f112a66fafa8f70316
Author: Venky Shankar <vshankar@redhat.com>
Date:   Mon Jul 24 00:33:47 2023 -0400

    mds: add mdlog trimming threshold and decay counter
    
    Fixes: http://tracker.ceph.com/issues/61908
    Signed-off-by: Venky Shankar <vshankar@redhat.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants