Project

General

Profile

Actions

Bug #65265

closed

qa: health warning "no active mgr (MGR_DOWN)" occurs before and after test_nfs runs

Added by Rishabh Dave almost 2 years ago. Updated 5 months ago.

Status:
Resolved
Priority:
Urgent
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Backport:
quincy,reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
mgr/nfs
Labels (FS):
Pull request ID:
Tags (freeform):
Fixed In:
v19.3.0-2212-g999ca78a1a
Released In:
v20.2.0~2885
Upkeep Timestamp:
2025-11-01T01:12:49+00:00

Description

Link to the job - https://pulpito.ceph.com/rishabh-2024-03-27_05:27:11-fs-wip-rishabh-testing-20240326.131558-testing-default-smithi/7625569/

The tests (qa/tasks/cephfs/test_nfs.py) ran successfully but the job failed due to the unexpected health warnings -

2024-03-27T06:38:24.458 INFO:teuthology.orchestra.run.smithi184.stdout:2024-03-27T06:07:34.219228+0000 mon.a (mon.0) 323 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)

This health warning occurred 4 times in total, 2 times before tes_nfs.py started running and 2 times after test_nfs.py finished running and never during test_nfs.py was running.

Warning 1, line 11268 - 2024-03-27T06:07:34.833 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:07:34 smithi184 bash[21504]: cluster 2024-03-27T06:07:34.219228+0000 mon.a (mon.0) 323 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)
Warning 2, line 23642 - 2024-03-27T06:07:49.277 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:07:48 smithi184 bash[21504]: cluster 2024-03-27T06:07:47.832342+0000 mon.a (mon.0) 342 : cluster [INF] Health check cleared: MGR_DOWN (was: no active mgr)
Then cluster becomes healthy, from line 23643 -

2024-03-27T06:07:49.277 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:07:48 smithi184 bash[21504]: cluster 2024-03-27T06:07:47.832393+0000 mon.a (mon.0) 343 : cluster [INF] Cluster is now healthy
2024-03-27T06:07:49.277 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:07:48 smithi184 bash[21504]: cluster 2024-03-27T06:07:47.836380+0000 mon.a (mon.0) 344 : cluster [DBG] mgrmap e20: x(active, star

Tests start running, line 42136 - 2024-03-27T06:07:52.025 INFO:tasks.cephfs_test_runner:Starting test: test_cephfs_export_update_at_non_dir_path (tasks.cephfs.test_nfs.TestNFS)
Cluster is health again when tests are at the end of test_nfs.py, from line 231178 -
2024-03-27T06:32:18.776 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:32:18 smithi184 bash[21504]: cluster 2024-03-27T06:32:17.531023+0000 mon.a (mon.0) 3332 : cluster [INF] Health check cleared: FS_DEGRADED (was: 1 filesystem is degraded)
2024-03-27T06:32:18.776 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:32:18 smithi184 bash[21504]: cluster 2024-03-27T06:32:17.531067+0000 mon.a (mon.0) 3333 : cluster [INF] Cluster is now healthy

Tests finish running, line 247158 - 2024-03-27T06:37:26.372 INFO:tasks.cephfs_test_runner:Ran 30 tests in 1796.992s

Warning 3, line 247231 - 2024-03-27T06:38:24.458 INFO:teuthology.orchestra.run.smithi184.stdout:2024-03-27T06:07:34.219228+0000 mon.a (mon.0) 323 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)
Warning 4, line 247236 - 2024-03-27T06:38:24.673 INFO:teuthology.orchestra.run.smithi184.stdout:2024-03-27T06:07:34.219228+0000 mon.a (mon.0) 323 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)

From /a/rishabh-2024-03-27_05:27:11-fs-wip-rishabh-testing-20240326.131558-testing-default-smithi/7625569/remote/smithi184/log/2d1fee3e-ebff-11ee-95d0-87774f69a715/ceph-mgr.x.log.gz -

2024-03-27T06:00:40.207+0000 7f426b494200  0 ceph version 19.0.0-2478-g155268c4 (155268c4e432a12433aa833f174f9fe3b1016ae0) squid (dev), process ceph-mgr, pid 7
2024-03-27T06:00:55.715+0000 7f303e506200  0 ceph version 19.0.0-2478-g155268c4 (155268c4e432a12433aa833f174f9fe3b1016ae0) squid (dev), process ceph-mgr, pid 7
2024-03-27T06:01:29.896+0000 7f8248a0c200  0 ceph version 19.0.0-2478-g155268c4 (155268c4e432a12433aa833f174f9fe3b1016ae0) squid (dev), process ceph-mgr, pid 7
2024-03-27T06:07:39.269+0000 7fbe2299a200  0 ceph version 19.0.0-2478-g155268c4 (155268c4e432a12433aa833f174f9fe3b1016ae0) squid (dev), process ceph-mgr, pid 7
2024-03-27T06:07:56.965+0000 7f09f5cd8200  0 ceph version 19.0.0-2478-g155268c4 (155268c4e432a12433aa833f174f9fe3b1016ae0) squid (dev), process ceph-mgr, pid 7
2024-03-27T06:18:30.951+0000 7f61b33d3200  0 ceph version 19.0.0-2478-g155268c4 (155268c4e432a12433aa833f174f9fe3b1016ae0) squid (dev), process ceph-mgr, pid 7
2024-03-27T06:18:40.251+0000 7f39b1c35200  0 ceph version 19.0.0-2478-g155268c4 (155268c4e432a12433aa833f174f9fe3b1016ae0) squid (dev), process ceph-mgr, pid 7

None of the PR in the testing batch looks related to this. Infact this doesn't look related to CephFS. Venky confirmed the same.


Related issues 4 (0 open4 closed)

Related to CephFS - Bug #65021: qa/suites/fs/nfs: cluster [WRN] Health check failed: 1 stray daemon(s) not managed by cephadm (CEPHADM_STRAY_DAEMON)" in cluster logDuplicateDhairya Parmar

Actions
Copied to CephFS - Backport #66060: squid: qa: health warning "no active mgr (MGR_DOWN)" occurs before and after test_nfs runsResolvedDhairya ParmarActions
Copied to CephFS - Backport #66061: reef: qa: health warning "no active mgr (MGR_DOWN)" occurs before and after test_nfs runsResolvedDhairya ParmarActions
Copied to CephFS - Backport #66062: quincy: qa: health warning "no active mgr (MGR_DOWN)" occurs before and after test_nfs runsResolvedDhairya ParmarActions
Actions #1

Updated by Rishabh Dave almost 2 years ago

  • Project changed from Ceph to CephFS
  • Labels (FS) qa-failure added
Actions #2

Updated by Laura Flores almost 2 years ago

Looks like the MGR went down because of:

/a/rishabh-2024-03-27_05:27:11-fs-wip-rishabh-testing-20240326.131558-testing-default-smithi/7625569/remote/smithi184/log/2d1fee3e-ebff-11ee-95d0-87774f69a715/ceph-mgr.x.log.gz

2024-03-27T06:08:46.804+0000 7f0970a00700  0 [nfs ERROR nfs.export] Failed to apply export: path /testfile is not a dir
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/nfs/export.py", line 76, in validate_cephfs_path
    cephfs_path_is_dir(mgr, fs_name, path)
  File "/usr/share/ceph/mgr/nfs/utils.py", line 104, in cephfs_path_is_dir
    raise NotADirectoryError()
NotADirectoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/nfs/export.py", line 581, in _change_export
    return self._apply_export(cluster_id, export)
  File "/usr/share/ceph/mgr/nfs/export.py", line 830, in _apply_export
    new_export_dict
  File "/usr/share/ceph/mgr/nfs/export.py", line 689, in create_export_from_dict
    validate_cephfs_path(self.mgr, fs_name, path)
  File "/usr/share/ceph/mgr/nfs/export.py", line 78, in validate_cephfs_path
    raise NFSException(f"path {path} is not a dir", -errno.ENOTDIR)
nfs.exception.NFSException: path /testfile is not a dir

Actions #3

Updated by Venky Shankar almost 2 years ago

  • Category set to Correctness/Safety
  • Assignee set to Dhairya Parmar
  • Priority changed from Normal to Urgent
  • Target version set to v20.0.0
  • Backport set to quincy,reef,squid
  • Component(FS) mgr/nfs added
  • Labels (FS) deleted (qa-failure)

Thanks for taking a look, Laura.

Dhariya, please take this one. AFAICT, this exception should have been handle in mgr/nfs and a errno should have been returned to the caller.

Actions #4

Updated by Venky Shankar almost 2 years ago

  • Related to Bug #65021: qa/suites/fs/nfs: cluster [WRN] Health check failed: 1 stray daemon(s) not managed by cephadm (CEPHADM_STRAY_DAEMON)" in cluster log added
Actions #5

Updated by Dhairya Parmar almost 2 years ago

how are we hitting this now, this code has been existent since quite sometime and it always had worked fine

Actions #6

Updated by Dhairya Parmar almost 2 years ago

`validate_cephfs_path()` calls `cephfs_path_is_dir()` for every path, if the path is not a dir it raises `NotADirectoryError()` and so the `validate_cephfs_path()` should catch it in the first `except` block and raise the exception. I'm not sure how this isn't working

def validate_cephfs_path(mgr: 'Module', fs_name: str, path: str) -> None:
    try:
        cephfs_path_is_dir(mgr, fs_name, path)
    except NotADirectoryError:
        raise NFSException(f"path {path} is not a dir", -errno.ENOTDIR)
    except cephfs.ObjectNotFound:
        raise NFSObjectNotFound(f"path {path} does not exist")
    except cephfs.Error as e:
        raise NFSException(e.args[1], -e.args[0])
def cephfs_path_is_dir(mgr: 'Module', fs: str, path: str) -> None:
    @functools.lru_cache(maxsize=1)
    def _get_cephfs_client() -> CephfsClient:
        return CephfsClient(mgr)
    cephfs_client = _get_cephfs_client()

    with open_filesystem(cephfs_client, fs) as fs_handle:
        stx = fs_handle.statx(path.encode('utf-8'), cephfs.CEPH_STATX_MODE,
                              cephfs.AT_SYMLINK_NOFOLLOW)
        if not stat.S_ISDIR(stx.get('mode')):
            raise NotADirectoryError()
Actions #7

Updated by Dhairya Parmar almost 2 years ago

Venky Shankar wrote in #note-3:

Thanks for taking a look, Laura.

Dhariya, please take this one. AFAICT, this exception should have been handle in mgr/nfs and a errno should have been returned to the caller.

it does exactly this, `raise NFSException(f"path {path} is not a dir", -errno.ENOTDIR)` is what should send errno and err str to the CLI

Actions #8

Updated by Venky Shankar almost 2 years ago

NotADirectoryError is probably not a valid (in-built) exception in some python version. My question is, if this exception is getting handles, then why is it showing up in the mgr log?

Actions #9

Updated by Venky Shankar almost 2 years ago

Dhairya mentioned that the tracebacks seems in the mgr logs are logged by object formatter and not necessarily unhandled exception. This means that those tracebacks aren't really the underlying issue for the MGR_DOWN warning.

Actions #10

Updated by Dhairya Parmar almost 2 years ago

I was confident of the code, I've mentioned this in https://tracker.ceph.com/issues/65265#note-6. I then raised a PR trying out something but the job failed [0] but this time it was

"2024-04-10T07:01:11.813042+0000 mon.a (mon.0) 610 : cluster [WRN] Health check failed: 1 stray daemon(s) not managed by cephadm (CEPHADM_STRAY_DAEMON)" in cluster log

Then I went probed the entire mgr log and I found 44 tracebacks(including the one mentioned by laura), those are just the logging of the exceptions raised(intentionally) since test_nfs consists of negative tests where invalid data is fed to make sure several edge cases are handled as intended. I'd have been surprised if those 'raised exception' logs weren't present in the mgr log.

I don't think this is an issue. I'm trying to investigate on why there weren't any active MGR, maybe something going wrong during upgrades.
[0] https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

Actions #11

Updated by Venky Shankar almost 2 years ago

Dhairya Parmar wrote in #note-10:

I was confident of the code, I've mentioned this in https://tracker.ceph.com/issues/65265#note-6. I then raised a PR trying out something but the job failed [0] but this time it was
[...]

Then I went probed the entire mgr log and I found 44 tracebacks(including the one mentioned by laura), those are just the logging of the exceptions raised(intentionally) since test_nfs consists of negative tests where invalid data is fed to make sure several edge cases are handled as intended. I'd have been surprised if those 'raised exception' logs weren't present in the mgr log.

I don't think this is an issue. I'm trying to investigate on why there weren't any active MGR, maybe something going wrong during upgrades.
[0] https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

That job has a single ceph-mgr daemon configured. test_exports_on_mgr_restart will fail the mgr for a jiffy - that might be causing the warning.

Actions #12

Updated by Dhairya Parmar almost 2 years ago

this doesn't seem related to test cases at all

time when the MGR_DOWN warning was seen:

2024-03-27T06:07:34.833 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:07:34 smithi184 bash[21504]: cluster 2024-03-27T06:07:34.219228+0000 mon.a (mon.0) 323 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)

time when first test case ran:

2024-03-27T06:07:52.025 INFO:tasks.cephfs_test_runner:Starting test: test_cephfs_export_update_at_non_dir_path (tasks.cephfs.test_nfs.TestNFS)

failure reason too mentioned the same timestamp:

failure_reason: '"2024-03-27T06:07:34.219228+0000 mon.a (mon.0) 323 : cluster [WRN]
  Health check failed: no active mgr (MGR_DOWN)" in cluster log

Actions #13

Updated by Venky Shankar almost 2 years ago

Venky Shankar wrote in #note-11:

Dhairya Parmar wrote in #note-10:

I was confident of the code, I've mentioned this in https://tracker.ceph.com/issues/65265#note-6. I then raised a PR trying out something but the job failed [0] but this time it was
[...]

Then I went probed the entire mgr log and I found 44 tracebacks(including the one mentioned by laura), those are just the logging of the exceptions raised(intentionally) since test_nfs consists of negative tests where invalid data is fed to make sure several edge cases are handled as intended. I'd have been surprised if those 'raised exception' logs weren't present in the mgr log.

I don't think this is an issue. I'm trying to investigate on why there weren't any active MGR, maybe something going wrong during upgrades.
[0] https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

That job has a single ceph-mgr daemon configured. test_exports_on_mgr_restart will fail the mgr for a jiffy - that might be causing the warning.

Needs to be ignore listed then - possibly a fallout from the recent clog changes :/

Actions #14

Updated by Dhairya Parmar almost 2 years ago

Venky Shankar wrote in #note-13:

Venky Shankar wrote in #note-11:

Dhairya Parmar wrote in #note-10:

I was confident of the code, I've mentioned this in https://tracker.ceph.com/issues/65265#note-6. I then raised a PR trying out something but the job failed [0] but this time it was
[...]

Then I went probed the entire mgr log and I found 44 tracebacks(including the one mentioned by laura), those are just the logging of the exceptions raised(intentionally) since test_nfs consists of negative tests where invalid data is fed to make sure several edge cases are handled as intended. I'd have been surprised if those 'raised exception' logs weren't present in the mgr log.

I don't think this is an issue. I'm trying to investigate on why there weren't any active MGR, maybe something going wrong during upgrades.
[0] https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

That job has a single ceph-mgr daemon configured. test_exports_on_mgr_restart will fail the mgr for a jiffy - that might be causing the warning.

Needs to be ignore listed then - possibly a fallout from the recent clog changes :/

greg suggested we should go with 2 MGRs instead of 1, what do you think about this?

Actions #15

Updated by Venky Shankar almost 2 years ago

Dhairya Parmar wrote in #note-14:

Venky Shankar wrote in #note-13:

Venky Shankar wrote in #note-11:

Dhairya Parmar wrote in #note-10:

I was confident of the code, I've mentioned this in https://tracker.ceph.com/issues/65265#note-6. I then raised a PR trying out something but the job failed [0] but this time it was
[...]

Then I went probed the entire mgr log and I found 44 tracebacks(including the one mentioned by laura), those are just the logging of the exceptions raised(intentionally) since test_nfs consists of negative tests where invalid data is fed to make sure several edge cases are handled as intended. I'd have been surprised if those 'raised exception' logs weren't present in the mgr log.

I don't think this is an issue. I'm trying to investigate on why there weren't any active MGR, maybe something going wrong during upgrades.
[0] https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

That job has a single ceph-mgr daemon configured. test_exports_on_mgr_restart will fail the mgr for a jiffy - that might be causing the warning.

Needs to be ignore listed then - possibly a fallout from the recent clog changes :/

greg suggested we should go with 2 MGRs instead of 1, what do you think about this?

That works too +1

Actions #16

Updated by Dhairya Parmar almost 2 years ago

Venky Shankar wrote in #note-15:

Dhairya Parmar wrote in #note-14:

Venky Shankar wrote in #note-13:

Venky Shankar wrote in #note-11:

Dhairya Parmar wrote in #note-10:

I was confident of the code, I've mentioned this in https://tracker.ceph.com/issues/65265#note-6. I then raised a PR trying out something but the job failed [0] but this time it was
[...]

Then I went probed the entire mgr log and I found 44 tracebacks(including the one mentioned by laura), those are just the logging of the exceptions raised(intentionally) since test_nfs consists of negative tests where invalid data is fed to make sure several edge cases are handled as intended. I'd have been surprised if those 'raised exception' logs weren't present in the mgr log.

I don't think this is an issue. I'm trying to investigate on why there weren't any active MGR, maybe something going wrong during upgrades.
[0] https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

That job has a single ceph-mgr daemon configured. test_exports_on_mgr_restart will fail the mgr for a jiffy - that might be causing the warning.

Needs to be ignore listed then - possibly a fallout from the recent clog changes :/

greg suggested we should go with 2 MGRs instead of 1, what do you think about this?

That works too +1

sure, thanks

Actions #19

Updated by Patrick Donnelly almost 2 years ago

Dhairya Parmar wrote in #note-18:

I ran a couple of NFS jobs, no `MGR_DOWN` reported

https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_07:39:13-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_09:10:36-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

This warning is generated when mgr somehow got crashed and so no mgr available and `mgr fail` is ran.

What code is running `mgr fail` and why is the `fs` suite unaffected?

Actions #20

Updated by Dhairya Parmar almost 2 years ago · Edited

Patrick Donnelly wrote in #note-19:

Dhairya Parmar wrote in #note-18:

I ran a couple of NFS jobs, no `MGR_DOWN` reported

https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_07:39:13-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_09:10:36-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

This warning is generated when mgr somehow got crashed and so no mgr available and `mgr fail` is ran.

What code is running `mgr fail` and why is the `fs` suite unaffected?

`fs`suite is unaffected because the code is ran by `qa/tasks/mgr/mgr_test_case.py` (https://github.com/ceph/ceph/blob/befd8dce33758178d3b108219d73b7710f68b133/qa/tasks/mgr/mgr_test_case.py#L78-L86), and the reason why this is prevalent only in `fs:nfs` suite is because it is the only suite in fs that makes use of class `MgrTestCase`

Actions #21

Updated by Dhairya Parmar almost 2 years ago

Dhairya Parmar wrote in #note-20:

Patrick Donnelly wrote in #note-19:

Dhairya Parmar wrote in #note-18:

I ran a couple of NFS jobs, no `MGR_DOWN` reported

https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_07:39:13-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_09:10:36-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

This warning is generated when mgr somehow got crashed and so no mgr available and `mgr fail` is ran.

What code is running `mgr fail` and why is the `fs` suite unaffected?

`fs`suite is unaffected because the code is ran by `qa/tasks/mgr/mgr_test_case.py` (https://github.com/ceph/ceph/blob/befd8dce33758178d3b108219d73b7710f68b133/qa/tasks/mgr/mgr_test_case.py#L78-L86), and the reason why this is prevalent only in `fs:nfs` suite is because it is the only suite in fs that makes use of class `MgrTestCase`

The solution I can think of is to check if mgr exists before running `cls.mgr_cluster.mgr_fail(mgr_id)` in the above code snippet.

Actions #22

Updated by Patrick Donnelly almost 2 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 56944

Dhairya Parmar wrote in #note-20:

Patrick Donnelly wrote in #note-19:

Dhairya Parmar wrote in #note-18:

I ran a couple of NFS jobs, no `MGR_DOWN` reported

https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_07:39:13-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_09:10:36-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

This warning is generated when mgr somehow got crashed and so no mgr available and `mgr fail` is ran.

What code is running `mgr fail` and why is the `fs` suite unaffected?

`fs`suite is unaffected because the code is ran by `qa/tasks/mgr/mgr_test_case.py` (https://github.com/ceph/ceph/blob/befd8dce33758178d3b108219d73b7710f68b133/qa/tasks/mgr/mgr_test_case.py#L78-L86), and the reason why this is prevalent only in `fs:nfs` suite is because it is the only suite in fs that makes use of class `MgrTestCase`

Yes! Does that change your "root cause analysis" in your PR/commit message?

Actions #23

Updated by Patrick Donnelly almost 2 years ago

Dhairya Parmar wrote in #note-21:

Dhairya Parmar wrote in #note-20:

Patrick Donnelly wrote in #note-19:

Dhairya Parmar wrote in #note-18:

I ran a couple of NFS jobs, no `MGR_DOWN` reported

https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_07:39:13-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_09:10:36-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

This warning is generated when mgr somehow got crashed and so no mgr available and `mgr fail` is ran.

What code is running `mgr fail` and why is the `fs` suite unaffected?

`fs`suite is unaffected because the code is ran by `qa/tasks/mgr/mgr_test_case.py` (https://github.com/ceph/ceph/blob/befd8dce33758178d3b108219d73b7710f68b133/qa/tasks/mgr/mgr_test_case.py#L78-L86), and the reason why this is prevalent only in `fs:nfs` suite is because it is the only suite in fs that makes use of class `MgrTestCase`

The solution I can think of is to check if mgr exists before running `cls.mgr_cluster.mgr_fail(mgr_id)` in the above code snippet.

Ignoring the warning is correct. I want you to clean up your analysis in the commit/PR.

Actions #24

Updated by Dhairya Parmar almost 2 years ago

Patrick Donnelly wrote in #note-23:

Dhairya Parmar wrote in #note-21:

Dhairya Parmar wrote in #note-20:

Patrick Donnelly wrote in #note-19:

Dhairya Parmar wrote in #note-18:

I ran a couple of NFS jobs, no `MGR_DOWN` reported

https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_07:39:13-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_09:10:36-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

This warning is generated when mgr somehow got crashed and so no mgr available and `mgr fail` is ran.

What code is running `mgr fail` and why is the `fs` suite unaffected?

`fs`suite is unaffected because the code is ran by `qa/tasks/mgr/mgr_test_case.py` (https://github.com/ceph/ceph/blob/befd8dce33758178d3b108219d73b7710f68b133/qa/tasks/mgr/mgr_test_case.py#L78-L86), and the reason why this is prevalent only in `fs:nfs` suite is because it is the only suite in fs that makes use of class `MgrTestCase`

The solution I can think of is to check if mgr exists before running `cls.mgr_cluster.mgr_fail(mgr_id)` in the above code snippet.

Ignoring the warning is correct. I want you to clean up your analysis in the commit/PR.

done

Actions #26

Updated by Venky Shankar almost 2 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #27

Updated by Upkeep Bot almost 2 years ago

  • Copied to Backport #66060: squid: qa: health warning "no active mgr (MGR_DOWN)" occurs before and after test_nfs runs added
Actions #28

Updated by Upkeep Bot almost 2 years ago

  • Copied to Backport #66061: reef: qa: health warning "no active mgr (MGR_DOWN)" occurs before and after test_nfs runs added
Actions #29

Updated by Upkeep Bot almost 2 years ago

  • Copied to Backport #66062: quincy: qa: health warning "no active mgr (MGR_DOWN)" occurs before and after test_nfs runs added
Actions #31

Updated by Dhairya Parmar about 1 year ago

  • Status changed from Pending Backport to Resolved

backports done

Actions #32

Updated by Upkeep Bot 9 months ago

  • Merge Commit set to 999ca78a1ab51d83606f7dab6b9b0a63326191fb
  • Fixed In set to v19.3.0-2212-g999ca78a1ab5
  • Upkeep Timestamp set to 2025-07-02T03:16:57+00:00
Actions #33

Updated by Upkeep Bot 9 months ago

  • Fixed In changed from v19.3.0-2212-g999ca78a1ab5 to v19.3.0-2212-g999ca78a1ab
  • Upkeep Timestamp changed from 2025-07-02T03:16:57+00:00 to 2025-07-08T18:30:03+00:00
Actions #34

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v19.3.0-2212-g999ca78a1ab to v19.3.0-2212-g999ca78a1a
  • Upkeep Timestamp changed from 2025-07-08T18:30:03+00:00 to 2025-07-14T17:10:33+00:00
Actions #35

Updated by Upkeep Bot 5 months ago

  • Released In set to v20.2.0~2885
  • Upkeep Timestamp changed from 2025-07-14T17:10:33+00:00 to 2025-11-01T01:12:49+00:00
Actions

Also available in: Atom PDF