qa: add a YAML to ignore MGR_DOWN warning by dparmar18 · Pull Request #56944 · ceph/ceph

dparmar18 · 2024-04-17T09:28:27Z

RCA showed that it is not the NFS code that lead to the warning since the warning occurred before the test cases started to execute, later on after some discussion with the venky and greg, it was found that there were some clog changes made recently which leads to this warning being added to the clog.

Digging more further, it was found that the warning is generated when mgr fail is run when there is no mgr available. The reason for unavailability is when setup_mgrs() in class MgrTestCase stops the mgr daemons, sometimes the mgr just crashes - mgr handle_mgr_signal *** Got signal Terminated *** and after which mgr fail (again part of setup_mgrs()) is run and the MGR_DOWN warning is generated.

This warning is only evident in nfs is because this is the only fs suite that makes use of class MgrTestCase. To support my analysis, I had ran about 8 jobs in teuthology and I could not reproduce this warning. Since this is not harming the NFS test cases execution and the logs do mention that the mgr daemon did get restarted (INFO:tasks.cephadm.mgr.x:Restarting mgr.x (starting--it wasn't running)...), it is good to conclude that ignoring this warning is the simplest solution.

Fixes: https://tracker.ceph.com/issues/65265
Signed-off-by: Dhairya Parmar dparmar@redhat.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

dparmar18 · 2024-04-18T11:10:27Z

i still see the MGR_DOWN warning even after adding one more MGR
https://pulpito.ceph.com/dparmar-2024-04-18_09:23:37-fs:nfs-wip-65265-distro-default-smithi/

dparmar18 · 2024-04-18T11:11:03Z

2024-04-18T09:50:07.439 INFO:journalctl@ceph.mon.a.smithi121.stdout:Apr 18 09:50:07 smithi121 bash[21537]: cluster 2024-04-18T09:50:06.855313+0000 mon.a (mon.0) 358 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)

still evident even after adding one more MGR

https://pulpito.ceph.com/dparmar-2024-04-18_09:23:37-fs:nfs-wip-65265-distro-default-smithi/

dparmar18 · 2024-04-18T11:11:21Z

I guess adding this warning to allowlist is the only option

dparmar18 · 2024-04-18T11:14:14Z

so ignorelist_health.yaml -> .qa/cephfs/overrides/ignorelist_health.yaml is a cephfs suite wide YAML, i hope it is fine to add it @vshankar @gregsfortytwo

batrick

RCA showed that it is not the NFS code that lead to the warning since
the warning occured before the test cases started to execute, later on
after some discussion with the venky and greg, it was found that there
were some clog changes made recently which leads to this warning being
added to the clog.

Yes, the clog checks are failing the job. But why is the mgr getting restarted (and how) to cause MGR_DOWN? That is the root cause that needs identified.

dparmar18 · 2024-04-24T12:19:07Z

Okay we will take example of two runs:

Both run with exact same YAML matrix except the distro:

rishabh's run - where the MGR_DOWN warning was seen

fs/nfs/{cluster/{1-node} conf/{client mds mgr mon osd} overrides/{ignorelist_health pg_health} supported-random-distros$/{ubuntu_latest} tasks/nfs}

mine - the run failed but the failure is irrelevant since there is no MGR_DOWN warning seen

fs:nfs/{cluster/{1-node} conf/{client mds mgr mon osd} overrides/{ignorelist_health pg_health} supported-random-distros$/{centos_latest} tasks/nfs}

In 1),
there is command to stop the mgr

2024-03-27T06:07:29.758 INFO:tasks.cephadm.mgr.x:Stopping mgr.x...
2024-03-27T06:07:29.758 DEBUG:teuthology.orchestra.run.smithi184:> sudo systemctl stop ceph-2d1fee3e-ebff-11ee-95d0-87774f69a715@mgr.x
2024-03-27T06:07:30.028 INFO:journalctl@ceph.osd.0.smithi184.stdout:Mar 27 06:07:29 smithi184 systemd[1]: /etc/systemd/system/ceph-2d1fee3e-ebff-11ee-95d0-87774f69a715@.service:23: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.

then somehow it crashed

2024-03-27T06:07:30.143 INFO:journalctl@ceph.mgr.x.smithi184.stdout:Mar 27 06:07:30 smithi184 systemd[1]: Stopping Ceph mgr.x for 2d1fee3e-ebff-11ee-95d0-87774f69a715...
2024-03-27T06:07:30.777 INFO:journalctl@ceph.mgr.x.smithi184.stdout:Mar 27 06:07:30 smithi184 bash[21756]: debug 2024-03-27T06:07:30.729+0000 7f823c400700 -1 received  signal: Terminated from /sbin/docker-init -- /usr/bin/ceph-mgr -n mgr.x -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug   (PID: 1) UID: 0
2024-03-27T06:07:30.778 INFO:journalctl@ceph.mgr.x.smithi184.stdout:Mar 27 06:07:30 smithi184 bash[21756]: debug 2024-03-27T06:07:30.729+0000 7f823c400700 -1 mgr handle_mgr_signal  *** Got signal Terminated ***

and when mgr fail was run, it generated the warning:

2024-03-27T06:07:34.833 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:07:34 smithi184 bash[21504]: audit 2024-03-27T06:07:34.218831+0000 mon.a (mon.0) 322 : audit [INF] from='client.? 172.21.15.184:0/2918680897' entity='client.admin' cmd=[{"prefix": "mgr fail", "who": "x"}]: dispatch
2024-03-27T06:07:34.833 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:07:34 smithi184 bash[21504]: cluster 2024-03-27T06:07:34.219228+0000 mon.a (mon.0) 323 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)
2024-03-27T06:07:34.833 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:07:34 smithi184 bash[21504]: audit 2024-03-27T06:07:34.221017+0000 mon.a (mon.0) 324 : audit [INF] from='client.? 172.21.15.184:0/2918680897' entity='client.admin' cmd='[{"prefix": "mgr fail", "who": "x"}]': finished

i.e. the warning MGR_DOWN generates when the mgr fail command is run when there is no active mgr.

Now if we look at 2)

2024-04-10T06:59:57.131 INFO:tasks.cephadm.mgr.x:Stopping mgr.x...
2024-04-10T06:59:57.131 DEBUG:teuthology.orchestra.run.smithi031:> sudo systemctl stop ceph-57e90e8a-f707-11ee-b64a-cb9ed24678a4@mgr.x
2024-04-10T06:59:57.415 INFO:journalctl@ceph.mgr.x.smithi031.stdout:Apr 10 06:59:57 smithi031 systemd[1]: Stopping Ceph mgr.x for 57e90e8a-f707-11ee-b64a-cb9ed24678a4...

2024-04-10T06:59:58.443 INFO:journalctl@ceph.mgr.x.smithi031.stdout:Apr 10 06:59:58 smithi031 systemd[1]: ceph-57e90e8a-f707-11ee-b64a-cb9ed24678a4@mgr.x.service: Deactivated successfully.
2024-04-10T06:59:58.443 INFO:journalctl@ceph.mgr.x.smithi031.stdout:Apr 10 06:59:58 smithi031 systemd[1]: Stopped Ceph mgr.x for 57e90e8a-f707-11ee-b64a-cb9ed24678a4.

(i don't see above lines in (1)'s teuth logs)

2024-04-10T06:59:58.443 INFO:journalctl@ceph.mgr.x.smithi031.stdout:Apr 10 06:59:58 smithi031 systemd[1]: ceph-57e90e8a-f707-11ee-b64a-cb9ed24678a4@mgr.x.service: Consumed 33.490s CPU time.
2024-04-10T06:59:58.444 DEBUG:teuthology.orchestra.run.smithi031:> sudo pkill -f 'journalctl -f -n 0 -u ceph-57e90e8a-f707-11ee-b64a-cb9ed24678a4@mgr.x.service'

and the mgr fail succeeds

2024-04-10T06:59:59.962 INFO:teuthology.orchestra.run.smithi031.stderr:2024-04-10T06:59:59.965+0000 7f7a11ddc640  1 -- 172.21.15.31:0/1174769028 --> [v2:172.21.15.31:3300/0,v1:172.21.15.31:6789/0] -- mon_command({"prefix": "mgr fail", "who": "x"} v 0) -- 0x7f7a0c02d5e0 con 0x7f7a0c102590
2024-04-10T06:59:59.964 INFO:teuthology.orchestra.run.smithi031.stderr:2024-04-10T06:59:59.967+0000 7f7a097fa640  1 -- 172.21.15.31:0/1174769028 <== mon.0 v2:172.21.15.31:3300/0 7 ==== mon_command_ack([{"prefix": "mgr fail", "who": "x"}]=0  v16) ==== 68+0+0 (secure 0 0 0) 0x7f79fc03cc10 con 0x7f7a0c102590

dparmar18 · 2024-04-24T12:21:11Z

#56944 (comment) can be reproduced easily:

i first start a cluster:

dparmar:build$ ./bin/ceph -s
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2024-04-24T17:12:50.193+0530 7fb290bd06c0 -1 WARNING: all dangerous and experimental features are enabled.
2024-04-24T17:12:50.198+0530 7fb290bd06c0 -1 WARNING: all dangerous and experimental features are enabled.
  cluster:
    id:     8cadbeca-4f35-4ca9-b998-e72350d09fa4
    health: HEALTH_WARN
            Module 'rbd_support' has failed dependency: module 'rbd' has no attribute 'Image'
 
  services:
    mon: 3 daemons, quorum a,b,c (age 52s)
    mgr: x(active, since 47s)
    mds: 1/1 daemons up, 2 standby
    osd: 3 osds: 3 up (since 18s), 3 in (since 33s)
 
  data:
    volumes: 1/1 healthy
    pools:   3 pools, 3 pgs
    objects: 24 objects, 451 KiB
    usage:   3.0 GiB used, 300 GiB / 303 GiB avail
    pgs:     3 active+clean
 
  io:
    client:   1.7 KiB/s wr, 0 op/s rd, 5 op/s wr

verify no warnings exist
dparmar:build$ grep -rnI ./out/ 'MGR_DOWN'
kill mgr

dparmar:build$ pgrep ceph-mgr 
738866
dparmar:build$ kill -9 738866

wait a bit and check status again

dparmar:build$ ./bin/ceph -s
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2024-04-24T17:16:44.528+0530 7f07a774f6c0 -1 WARNING: all dangerous and experimental features are enabled.
2024-04-24T17:16:44.533+0530 7f07a774f6c0 -1 WARNING: all dangerous and experimental features are enabled.
  cluster:
    id:     8cadbeca-4f35-4ca9-b998-e72350d09fa4
    health: HEALTH_WARN
            no active mgr
            Module 'rbd_support' has failed dependency: module 'rbd' has no attribute 'Image'
 
  services:
    mon: 3 daemons, quorum a,b,c (age 4m)
    mgr: no daemons active (since 107s)
    mds: 1/1 daemons up, 2 standby
    osd: 3 osds: 3 up (since 4m), 3 in (since 4m)
 
  data:
    volumes: 1/1 healthy
    pools:   3 pools, 145 pgs
    objects: 24 objects, 451 KiB
    usage:   3.0 GiB used, 300 GiB / 303 GiB avail
    pgs:     145 active+clean

run mgr fail

dparmar:build$ ./bin/ceph mgr fail
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2024-04-24T17:14:56.626+0530 7f08a1d7c6c0 -1 WARNING: all dangerous and experimental features are enabled.
2024-04-24T17:14:56.632+0530 7f08a1d7c6c0 -1 WARNING: all dangerous and experimental features are enabled.

grep for MGR_DOWN

dparmar:build$ grep -rnI 'MGR_DOWN' ./out/
./out/mon.a.log:98750:2024-04-24T17:14:56.728+0530 7f08a05f46c0  0 log_channel(cluster) log [WRN] : Health check failed: no active mgr (MGR_DOWN)
./out/mon.a.log:98788:2024-04-24T17:14:56.745+0530 7f08a05f46c0 10 mon.a@0(leader).log v128  logging 2024-04-24T17:14:56.729209+0530 mon.a (mon.0) 240 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)
./out/mon.a.log:99057:2024-04-24T17:14:56.805+0530 7f089ddef6c0  7 mon.a@0(leader).log v129 update_from_paxos applying incremental log 129 2024-04-24T17:14:56.729209+0530 mon.a (mon.0) 240 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)
./out/mon.b.log:38194:2024-04-24T17:14:56.814+0530 7fc3bf9f46c0  7 mon.b@1(peon).log v129 update_from_paxos applying incremental log 129 2024-04-24T17:14:56.729209+0530 mon.a (mon.0) 240 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)
./out/mon.c.log:45422:2024-04-24T17:14:56.813+0530 7f74883f46c0  7 mon.c@2(peon).log v129 update_from_paxos applying incremental log 129 2024-04-24T17:14:56.729209+0530 mon.a (mon.0) 240 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)
./out/cluster.mon.a.log:662:2024-04-24T17:14:56.729209+0530 mon.a (mon.0) 240 : cluster 3 Health check failed: no active mgr (MGR_DOWN)
./out/cluster.mon.b.log:662:2024-04-24T17:14:56.729209+0530 mon.a (mon.0) 240 : cluster 3 Health check failed: no active mgr (MGR_DOWN)
./out/cluster.mon.c.log:662:2024-04-24T17:14:56.729209+0530 mon.a (mon.0) 240 : cluster 3 Health check failed: no active mgr (MGR_DOWN)

dparmar18 · 2024-04-25T10:12:13Z

I ran a couple of NFS jobs, no MGR_DOWN reported

https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_07:39:13-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_09:10:36-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

MGR_DOWN is reported when the mgr fail is run when there is no mgr available.

dparmar18 · 2024-04-25T11:09:27Z

So below log is from the MGR_DOWN warning run

2024-04-24T07:57:24.232 INFO:tasks.cephadm.mgr.x:Stopping mgr.x...
2024-04-24T07:57:24.232 DEBUG:teuthology.orchestra.run.smithi179:> sudo systemctl stop ceph-7d875f26-020f-11ef-bc93-c7b262605968@mgr.x
2024-04-24T07:57:24.467 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:24 smithi179 ceph-mon[25686]: from='client.? 172.21.15.179:0/2524918627' entity='client.admin' cmd=[{"prefix": "osd pool get", "pool": ".mgr", "var": "pg_num"}]: dispatch
2024-04-24T07:57:24.468 INFO:journalctl@ceph.mgr.x.smithi179.stdout:Apr 24 07:57:24 smithi179 systemd[1]: Stopping Ceph mgr.x for 7d875f26-020f-11ef-bc93-c7b262605968...
2024-04-24T07:57:25.468 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: pgmap v117: 1 pgs: 1 active+clean; 577 KiB data, 82 MiB used, 268 GiB / 268 GiB avail
2024-04-24T07:57:25.468 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: from='client.? 172.21.15.179:0/684086560' entity='client.admin'
2024-04-24T07:57:25.468 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: mds.? [v2:172.21.15.179:6826/3576354581,v1:172.21.15.179:6827/3576354581] up:boot
2024-04-24T07:57:25.468 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: fsmap  1 up:standby
2024-04-24T07:57:25.468 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: from='mgr.14152 172.21.15.179:0/1611738610' entity='mgr.x' cmd=[{"prefix": "mds metadata", "who": "a.smithi179.dthvbt"}]: dispatch
2024-04-24T07:57:25.469 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: from='mgr.14152 172.21.15.179:0/1611738610' entity='mgr.x'
2024-04-24T07:57:25.469 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: from='mgr.14152 172.21.15.179:0/1611738610' entity='mgr.x'
2024-04-24T07:57:25.469 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: from='mgr.14152 172.21.15.179:0/1611738610' entity='mgr.x'
2024-04-24T07:57:25.469 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: from='mgr.14152 172.21.15.179:0/1611738610' entity='mgr.x' cmd=[{"prefix": "auth get-or-create", "entity": "mds.a.smithi179.fefbdj", "caps": ["mon", "profile mds", "osd", "allow rw tag cephfs *=*", "mds", "allow"]}]: dispatch
2024-04-24T07:57:25.469 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: from='mgr.14152 172.21.15.179:0/1611738610' entity='mgr.x' cmd='[{"prefix": "auth get-or-create", "entity": "mds.a.smithi179.fefbdj", "caps": ["mon", "profile mds", "osd", "allow rw tag cephfs *=*", "mds", "allow"]}]': finished
2024-04-24T07:57:25.469 INFO:journalctl@ceph.mgr.x.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-7d875f26-020f-11ef-bc93-c7b262605968-mgr-x[25924]: 
2024-04-24T08:26:24.468 INFO:journalctl@ceph.mgr.x.smithi179.stdout:Apr 24 08:26:24 smithi179 systemd[1]: Stopping Ceph mgr.x for 7d875f26-020f-11ef-bc93-c7b262605968...
2024-04-24T08:26:24.912 INFO:journalctl@ceph.mgr.x.smithi179.stdout:Apr 24 08:26:24 smithi179 ceph-7d875f26-020f-11ef-bc93-c7b262605968-mgr-x[57669]: 2024-04-24T08:26:24.509+0000 7fdfd8400700 -1 received  signal: Terminated from /run/podman-init -- /usr/bin/ceph-mgr -n mgr.x -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false  (PID: 1) UID: 0
2024-04-24T08:26:24.912 INFO:journalctl@ceph.mgr.x.smithi179.stdout:Apr 24 08:26:24 smithi179 ceph-7d875f26-020f-11ef-bc93-c7b262605968-mgr-x[57669]: 2024-04-24T08:26:24.509+0000 7fdfd8400700 -1 mgr handle_mgr_signal  *** Got signal Terminated ***
2024-04-24T08:26:24.912 INFO:journalctl@ceph.mgr.x.smithi179.stdout:Apr 24 08:26:24 smithi179 podman[166144]: 2024-04-24 08:26:24.602345947 +0000 UTC m=+0.221365836 container died 05a72e0ed9e3da800075059d99e15165c2e8d74aac0c557b68aa38d371199720 (image=quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:1f81fb397ae98da7563d451a78c61574c8f4e6e0, name=ceph-7d875f26-020f-11ef-bc93-c7b262605968-mgr-x, GIT_REPO=git@github.com:ceph/ceph-container.git, maintainer=Guillaume Abrioux <gabrioux@redhat.com>, org.label-schema.license=GPLv2, ceph=True, RELEASE=rishabh-main-apr17-a654945-1f81fb3, org.label-schema.schema-version=1.0, GIT_BRANCH=HEAD, io.buildah.version=1.33.5, org.label-schema.name=CentOS Stream 8 Base Image, org.label-schema.build-date=20240424, GIT_CLEAN=True, GIT_COMMIT=e83daab126629de4b83b70f7819e84da953b5b67, org.label-schema.vendor=CentOS, CEPH_POINT_RELEASE=)
2024-04-24T08:26:25.218 INFO:journalctl@ceph.mgr.x.smithi179.stdout:Apr 24 08:26:25 smithi179 podman[166144]: 2024-04-24 08:26:25.062096408 +0000 UTC m=+0.681116337 container cleanup 05a72e0ed9e3da800075059d99e15165c2e8d74aac0c557b68aa38d371199720 (image=quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:1f81fb397ae98da7563d451a78c61574c8f4e6e0, name=ceph-7d875f26-020f-11ef-bc93-c7b262605968-mgr-x, org.label-schema.schema-version=1.0, CEPH_POINT_RELEASE=, GIT_CLEAN=True, maintainer=Guillaume Abrioux <gabrioux@redhat.com>, GIT_COMMIT=e83daab126629de4b83b70f7819e84da953b5b67, RELEASE=rishabh-main-apr17-a654945-1f81fb3, org.label-schema.name=CentOS Stream 8 Base Image, org.label-schema.build-date=20240424, org.label-schema.license=GPLv2, org.label-schema.vendor=CentOS, GIT_BRANCH=HEAD, io.buildah.version=1.33.5, GIT_REPO=git@github.com:ceph/ceph-container.git, ceph=True)
2024-04-24T08:26:25.218 INFO:journalctl@ceph.mgr.x.smithi179.stdout:Apr 24 08:26:25 smithi179 bash[166144]: ceph-7d875f26-020f-11ef-bc93-c7b262605968-mgr-x
2024-04-24T08:26:25.718 INFO:journalctl@ceph.mgr.x.smithi179.stdout:Apr 24 08:26:25 smithi179 podman[166157]: 2024-04-24 08:26:25.305368421 +0000 UTC m=+0.683065612 container remove 05a72e0ed9e3da800075059d99e15165c2e8d74aac0c557b68aa38d371199720 (image=quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:1f81fb397ae98da7563d451a78c61574c8f4e6e0, name=ceph-7d875f26-020f-11ef-bc93-c7b262605968-mgr-x, org.label-schema.schema-version=1.0, org.label-schema.license=GPLv2, GIT_BRANCH=HEAD, org.label-schema.vendor=CentOS, org.label-schema.name=CentOS Stream 8 Base Image, RELEASE=rishabh-main-apr17-a654945-1f81fb3, GIT_CLEAN=True, io.buildah.version=1.33.5, maintainer=Guillaume Abrioux <gabrioux@redhat.com>, org.label-schema.build-date=20240424, GIT_REPO=git@github.com:ceph/ceph-container.git, GIT_COMMIT=e83daab126629de4b83b70f7819e84da953b5b67, ceph=True, CEPH_POINT_RELEASE=)
2024-04-24T08:26:25.798 DEBUG:teuthology.orchestra.run.smithi179:> sudo pkill -f 'journalctl -f -n 0 -u ceph-7d875f26-020f-11ef-bc93-c7b262605968@mgr.x.service'
2024-04-24T08:26:25.900 DEBUG:teuthology.orchestra.run:got remote process result: None
2024-04-24T08:26:25.900 INFO:tasks.cephadm.mgr.x:Stopped mgr.x

and this is from my nfs run where there was no MGR_DOWN detected

2024-04-25T08:23:50.790 INFO:tasks.cephadm.mgr.x:Stopping mgr.x...
2024-04-25T08:23:50.790 DEBUG:teuthology.orchestra.run.smithi187:> sudo systemctl stop ceph-5e8a0838-02d9-11ef-bc93-c7b262605968@mgr.x
2024-04-25T08:23:51.141 INFO:journalctl@ceph.mgr.x.smithi187.stdout:Apr 25 08:23:50 smithi187 systemd[1]: Stopping Ceph mgr.x for 5e8a0838-02d9-11ef-bc93-c7b262605968...
2024-04-25T08:23:51.475 INFO:journalctl@ceph.mgr.x.smithi187.stdout:Apr 25 08:23:51 smithi187 ceph-5e8a0838-02d9-11ef-bc93-c7b262605968-mgr-x[57042]:
2024-04-25T08:23:51.141+0000 7f54b1741700 -1 received  signal: Terminated from /run/podman-init -- /usr/bin/ceph-mgr -n mgr.x -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false  (PID: 1) UID: 0
2024-04-25T08:23:51.475 INFO:journalctl@ceph.mgr.x.smithi187.stdout:Apr 25 08:23:51 smithi187 podman[163721]: 2024-04-25 08:23:51.17220845 +0000 UTC m=+0.155629819 container died 5bb9877c1f8812bf06f7076c41d60f35eb3e6353987a9e0d26d056a88311eaf2 (image=quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:bfd6349fc33916a8c245f0a9ad7299ef30071853, name=ceph-5e8a0838-02d9-11ef-bc93-c7b262605968-mgr-x, org.label-schema.vendor=CentOS, org.label-schema.license=GPLv2, CEPH_POINT_RELEASE=, org.label-schema.name=CentOS Stream 8 Base Image, org.label-schema.schema-version=1.0, ceph=True, RELEASE=dparmar-24-apr-main-bfd6349, io.buildah.version=1.33.5, GIT_REPO=git@github.com:ceph/ceph-container.git, org.label-schema.build-date=20240424, GIT_BRANCH=HEAD, GIT_COMMIT=e83daab126629de4b83b70f7819e84da953b5b67, maintainer=Guillaume Abrioux <gabrioux@redhat.com>, GIT_CLEAN=True)
2024-04-25T08:23:51.784 INFO:journalctl@ceph.mgr.x.smithi187.stdout:Apr 25 08:23:51 smithi187 podman[163721]: 2024-04-25 08:23:51.558381263 +0000 UTC m=+0.541802624 container cleanup 5bb9877c1f8812bf06f7076c41d60f35eb3e6353987a9e0d26d056a88311eaf2 (image=quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:bfd6349fc33916a8c245f0a9ad7299ef30071853, name=ceph-5e8a0838-02d9-11ef-bc93-c7b262605968-mgr-x, ceph=True, maintainer=Guillaume Abrioux <gabrioux@redhat.com>, org.label-schema.build-date=20240424, GIT_CLEAN=True, CEPH_POINT_RELEASE=, GIT_COMMIT=e83daab126629de4b83b70f7819e84da953b5b67, org.label-schema.vendor=CentOS, org.label-schema.license=GPLv2, GIT_BRANCH=HEAD, org.label-schema.name=CentOS Stream 8 Base Image, GIT_REPO=git@github.com:ceph/ceph-container.git, RELEASE=dparmar-24-apr-main-bfd6349, io.buildah.version=1.33.5, org.label-schema.schema-version=1.0)
2024-04-25T08:23:51.784 INFO:journalctl@ceph.mgr.x.smithi187.stdout:Apr 25 08:23:51 smithi187 bash[163721]: ceph-5e8a0838-02d9-11ef-bc93-c7b262605968-mgr-x
2024-04-25T08:23:51.918 DEBUG:teuthology.orchestra.run.smithi187:> sudo pkill -f 'journalctl -f -n 0 -u ceph-5e8a0838-02d9-11ef-bc93-c7b262605968@mgr.x.service'
2024-04-25T08:23:51.952 DEBUG:teuthology.orchestra.run:got remote process result: None
2024-04-25T08:23:51.953 INFO:tasks.cephadm.mgr.x:Stopped mgr.x

the failed run called this

static void handle_mgr_signal(int signum)
{
  derr << " *** Got signal " << sig_str(signum) << " ***" << dendl;

  // The python modules don't reliably shut down, so don't even
  // try. The mon will blocklist us (and all of our rados/cephfs
  // clients) anyway. Just exit!

  _exit(0);  // exit with 0 result code, as if we had done an orderly shutdown
}

while the one that passed didn't.

i've ran 8 jobs till now(#56944 (comment)), none of them experienced this. This looks to me like an intermittent failure. the mgr didn't shut down well which the static void handle_mgr_signal()'s comment also indicates

dparmar18 · 2024-04-25T14:42:42Z

mgr fail is run by qa/tasks/mgr/mgr_test_case.py as a part to setup mgrs and the reason why this is prevalent only in nfs suite is because it is the only suite in fs that makes use of class MgrTestCase

batrick

otherwise lgtm

batrick · 2024-04-29T20:47:44Z

qa/suites/fs/nfs/overrides/ignore_mgr_down.yaml

@@ -0,0 +1,7 @@
+# after some recent clog changes, MGR_DOWN warnings are generated before the
+# NFS test cases execution begins and this leads to the job getting failed.


It's sufficient to say MGR_DOWN is required because mgr fail will restart the mgr between tests.

@dparmar18 I've tagged this with my testing branch. Please resolve this comment soon.

It's sufficient to say MGR_DOWN is required because mgr fail will restart the mgr between tests.

mgr fail happens before the tests are run, i haven't seen them between tests.

2024-03-27T06:07:34.833 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:07:34 smithi184 bash[21504]: audit 2024-03-27T06:07:34.221017+0000 mon.a (mon.0) 324 : audit [INF] from='client.? 172.21.15.184:0/2918680897' entity='client.admin' cmd='[{"prefix": "mgr fail", "who": "x"}]': finished

2024-03-27T06:07:52.025 INFO:tasks.cephfs_test_runner:Starting test: test_cephfs_export_update_at_non_dir_path (tasks.cephfs.test_nfs.TestNFS)

It's sufficient to say MGR_DOWN is required because mgr fail will restart the mgr between tests.

mgr fail happens before the tests are run, i haven't seen them between tests.

Right, the MgrTestCase.setup_mgrs is called only when the test class is constructed. Anyway, please fix your comment.

* refs/pull/56944/head: qa: add a YAML to ignore MGR_DOWN warning Reviewed-by: Venky Shankar <vshankar@redhat.com>

RCA showed that it is not the NFS code that lead to the warning since the warning occurred before the test cases started to execute, later on after some discussion with the venky and greg, it was found that there were some clog changes made recently which leads to this warning being added to the clog. Digging more further, it was found that the warning is generated when mgr fail is run when there is no mgr available. The reason for unavailability is when `setup_mgrs()` in class `MgrTestCase` stops the mgr daemons, sometimes the mgr just crashes - `mgr handle_mgr_signal *** Got signal Terminated ***` and after which `mgr fail` (again part of `setup_mgrs()`) is run and the `MGR_DOWN` warning is generated. This warning is only evident in nfs is because this is the only fs suite that makes use of class `MgrTestCase`. To support my analysis, I had ran about eight jobs in teuthology and I could not reproduce this warning. Since this is not harming the NFS test cases execution and the logs do mention that the mgr daemon did get restarted (`INFO:tasks.cephadm.mgr.x:Restarting mgr.x (starting--it wasn't running)...`), it is good to conclude that ignoring this warning is the simplest solution. Fixes: https://tracker.ceph.com/issues/65265 Signed-off-by: Dhairya Parmar <dparmar@redhat.com>

vshankar · 2024-05-09T11:47:00Z

https://pulpito.ceph.com/vshankar-2024-05-07_03:44:24-fs-wip-vshankar-testing-20240506.153513-testing-default-smithi/

(unfortunately, failed are infra issues related to cephadm - would need a rebuild)

vshankar · 2024-05-09T11:47:22Z

jenkins retest this please

vshankar · 2024-05-09T11:49:30Z

This PR is under test in https://tracker.ceph.com/issues/65882.

vshankar

https://pulpito.ceph.com/?branch=wip-vshankar-testing-20240509.114851-debug

vshankar · 2024-05-16T05:58:44Z

jenkins test windows

Minor comment.

github-actions bot added the cephfs Ceph File System label Apr 17, 2024

dparmar18 force-pushed the fix-mgr-down-warning-nfs-suite branch from 6268392 to b3a32d9 Compare April 17, 2024 09:29

dparmar18 requested a review from a team April 17, 2024 12:27

dparmar18 force-pushed the fix-mgr-down-warning-nfs-suite branch from b3a32d9 to 16695f1 Compare April 18, 2024 14:38

dparmar18 changed the title ~~qa: fix "no active mgr (MGR_DOWN)" in fs:nfs suite~~ qa: add a YAML to ignore MGR_DOWN warning Apr 18, 2024

dparmar18 force-pushed the fix-mgr-down-warning-nfs-suite branch from 16695f1 to 3367702 Compare April 18, 2024 14:50

batrick requested changes Apr 20, 2024

View reviewed changes

dparmar18 force-pushed the fix-mgr-down-warning-nfs-suite branch from 3367702 to c6bfd0b Compare April 26, 2024 11:03

dparmar18 requested review from a team and batrick April 26, 2024 11:04

dparmar18 added the needs-review label Apr 26, 2024

batrick previously requested changes Apr 29, 2024

View reviewed changes

vshankar approved these changes May 6, 2024

View reviewed changes

vshankar added the wip-vshankar-testing1 label May 6, 2024

vshankar added a commit to vshankar/ceph that referenced this pull request May 6, 2024

Merge PR ceph#56944 into wip-vshankar-testing-20240506.153513

1d2b164

* refs/pull/56944/head: qa: add a YAML to ignore MGR_DOWN warning Reviewed-by: Venky Shankar <vshankar@redhat.com>

dparmar18 force-pushed the fix-mgr-down-warning-nfs-suite branch from c6bfd0b to 7d954ce Compare May 8, 2024 10:49

vshankar approved these changes May 16, 2024

View reviewed changes

vshankar merged commit 999ca78 into ceph:main May 16, 2024

vshankar removed the wip-vshankar-testing1 label May 16, 2024

dparmar18 mentioned this pull request May 20, 2024

quincy: qa: add a YAML to ignore MGR_DOWN warning #57564

Merged

dparmar18 removed the needs-review label May 20, 2024

This was referenced May 20, 2024

reef: qa: add a YAML to ignore MGR_DOWN warning #57565

Merged

squid: qa: add a YAML to ignore MGR_DOWN warning #57566

Merged

		@@ -0,0 +1,7 @@
		# after some recent clog changes, MGR_DOWN warnings are generated before the
		# NFS test cases execution begins and this leads to the job getting failed.

Conversation

dparmar18 commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

dparmar18 commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dparmar18 commented Apr 18, 2024

Uh oh!

dparmar18 commented Apr 18, 2024

Uh oh!

dparmar18 commented Apr 18, 2024

Uh oh!

batrick left a comment

Choose a reason for hiding this comment

Uh oh!

dparmar18 commented Apr 24, 2024

Uh oh!

dparmar18 commented Apr 24, 2024

Uh oh!

dparmar18 commented Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dparmar18 commented Apr 25, 2024

Uh oh!

dparmar18 commented Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

batrick left a comment

Choose a reason for hiding this comment

Uh oh!

batrick Apr 29, 2024

Choose a reason for hiding this comment

Uh oh!

vshankar May 6, 2024

Choose a reason for hiding this comment

Uh oh!

dparmar18 May 6, 2024

Choose a reason for hiding this comment

Uh oh!

dparmar18 May 6, 2024

Choose a reason for hiding this comment

Uh oh!

batrick May 6, 2024

Choose a reason for hiding this comment

Uh oh!

vshankar commented May 9, 2024

Uh oh!

vshankar commented May 9, 2024

Uh oh!

vshankar commented May 9, 2024

Uh oh!

vshankar left a comment

Choose a reason for hiding this comment

Uh oh!

vshankar commented May 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dparmar18 commented Apr 17, 2024 •

edited

Loading

dparmar18 commented Apr 18, 2024 •

edited

Loading

dparmar18 commented Apr 25, 2024 •

edited

Loading

dparmar18 commented Apr 25, 2024 •

edited

Loading