Skip to content

qa: add a YAML to ignore MGR_DOWN warning#56944

Merged
vshankar merged 1 commit intoceph:mainfrom
dparmar18:fix-mgr-down-warning-nfs-suite
May 16, 2024
Merged

qa: add a YAML to ignore MGR_DOWN warning#56944
vshankar merged 1 commit intoceph:mainfrom
dparmar18:fix-mgr-down-warning-nfs-suite

Conversation

@dparmar18
Copy link
Contributor

@dparmar18 dparmar18 commented Apr 17, 2024

RCA showed that it is not the NFS code that lead to the warning since the warning occurred before the test cases started to execute, later on after some discussion with the venky and greg, it was found that there were some clog changes made recently which leads to this warning being added to the clog.

Digging more further, it was found that the warning is generated when mgr fail is run when there is no mgr available. The reason for unavailability is when setup_mgrs() in class MgrTestCase stops the mgr daemons, sometimes the mgr just crashes - mgr handle_mgr_signal *** Got signal Terminated *** and after which mgr fail (again part of setup_mgrs()) is run and the MGR_DOWN warning is generated.

This warning is only evident in nfs is because this is the only fs suite that makes use of class MgrTestCase. To support my analysis, I had ran about 8 jobs in teuthology and I could not reproduce this warning. Since this is not harming the NFS test cases execution and the logs do mention that the mgr daemon did get restarted (INFO:tasks.cephadm.mgr.x:Restarting mgr.x (starting--it wasn't running)...), it is good to conclude that ignoring this warning is the simplest solution.

Fixes: https://tracker.ceph.com/issues/65265
Signed-off-by: Dhairya Parmar dparmar@redhat.com

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@github-actions github-actions bot added the cephfs Ceph File System label Apr 17, 2024
@dparmar18 dparmar18 force-pushed the fix-mgr-down-warning-nfs-suite branch from 6268392 to b3a32d9 Compare April 17, 2024 09:29
@dparmar18 dparmar18 requested a review from a team April 17, 2024 12:27
@dparmar18
Copy link
Contributor Author

dparmar18 commented Apr 18, 2024

i still see the MGR_DOWN warning even after adding one more MGR
https://pulpito.ceph.com/dparmar-2024-04-18_09:23:37-fs:nfs-wip-65265-distro-default-smithi/

@dparmar18
Copy link
Contributor Author

2024-04-18T09:50:07.439 INFO:journalctl@ceph.mon.a.smithi121.stdout:Apr 18 09:50:07 smithi121 bash[21537]: cluster 2024-04-18T09:50:06.855313+0000 mon.a (mon.0) 358 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)

still evident even after adding one more MGR

https://pulpito.ceph.com/dparmar-2024-04-18_09:23:37-fs:nfs-wip-65265-distro-default-smithi/

@dparmar18
Copy link
Contributor Author

I guess adding this warning to allowlist is the only option

@dparmar18
Copy link
Contributor Author

so ignorelist_health.yaml -> .qa/cephfs/overrides/ignorelist_health.yaml is a cephfs suite wide YAML, i hope it is fine to add it @vshankar @gregsfortytwo

@dparmar18 dparmar18 force-pushed the fix-mgr-down-warning-nfs-suite branch from b3a32d9 to 16695f1 Compare April 18, 2024 14:38
@dparmar18 dparmar18 changed the title qa: fix "no active mgr (MGR_DOWN)" in fs:nfs suite qa: add a YAML to ignore MGR_DOWN warning Apr 18, 2024
@dparmar18 dparmar18 force-pushed the fix-mgr-down-warning-nfs-suite branch from 16695f1 to 3367702 Compare April 18, 2024 14:50
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RCA showed that it is not the NFS code that lead to the warning since
the warning occured before the test cases started to execute, later on
after some discussion with the venky and greg, it was found that there
were some clog changes made recently which leads to this warning being
added to the clog.

Yes, the clog checks are failing the job. But why is the mgr getting restarted (and how) to cause MGR_DOWN? That is the root cause that needs identified.

@dparmar18
Copy link
Contributor Author

Okay we will take example of two runs:

  1. https://pulpito.ceph.com/rishabh-2024-03-27_05:27:11-fs-wip-rishabh-testing-20240326.131558-testing-default-smithi/7625569/
  2. https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

Both run with exact same YAML matrix except the distro:

  1. rishabh's run - where the MGR_DOWN warning was seen
fs/nfs/{cluster/{1-node} conf/{client mds mgr mon osd} overrides/{ignorelist_health pg_health} supported-random-distros$/{ubuntu_latest} tasks/nfs}
  1. mine - the run failed but the failure is irrelevant since there is no MGR_DOWN warning seen
fs:nfs/{cluster/{1-node} conf/{client mds mgr mon osd} overrides/{ignorelist_health pg_health} supported-random-distros$/{centos_latest} tasks/nfs}

In 1),
there is command to stop the mgr

2024-03-27T06:07:29.758 INFO:tasks.cephadm.mgr.x:Stopping mgr.x...
2024-03-27T06:07:29.758 DEBUG:teuthology.orchestra.run.smithi184:> sudo systemctl stop ceph-2d1fee3e-ebff-11ee-95d0-87774f69a715@mgr.x
2024-03-27T06:07:30.028 INFO:journalctl@ceph.osd.0.smithi184.stdout:Mar 27 06:07:29 smithi184 systemd[1]: /etc/systemd/system/ceph-2d1fee3e-ebff-11ee-95d0-87774f69a715@.service:23: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.

then somehow it crashed

2024-03-27T06:07:30.143 INFO:journalctl@ceph.mgr.x.smithi184.stdout:Mar 27 06:07:30 smithi184 systemd[1]: Stopping Ceph mgr.x for 2d1fee3e-ebff-11ee-95d0-87774f69a715...
2024-03-27T06:07:30.777 INFO:journalctl@ceph.mgr.x.smithi184.stdout:Mar 27 06:07:30 smithi184 bash[21756]: debug 2024-03-27T06:07:30.729+0000 7f823c400700 -1 received  signal: Terminated from /sbin/docker-init -- /usr/bin/ceph-mgr -n mgr.x -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug   (PID: 1) UID: 0
2024-03-27T06:07:30.778 INFO:journalctl@ceph.mgr.x.smithi184.stdout:Mar 27 06:07:30 smithi184 bash[21756]: debug 2024-03-27T06:07:30.729+0000 7f823c400700 -1 mgr handle_mgr_signal  *** Got signal Terminated ***

and when mgr fail was run, it generated the warning:

2024-03-27T06:07:34.833 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:07:34 smithi184 bash[21504]: audit 2024-03-27T06:07:34.218831+0000 mon.a (mon.0) 322 : audit [INF] from='client.? 172.21.15.184:0/2918680897' entity='client.admin' cmd=[{"prefix": "mgr fail", "who": "x"}]: dispatch
2024-03-27T06:07:34.833 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:07:34 smithi184 bash[21504]: cluster 2024-03-27T06:07:34.219228+0000 mon.a (mon.0) 323 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)
2024-03-27T06:07:34.833 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:07:34 smithi184 bash[21504]: audit 2024-03-27T06:07:34.221017+0000 mon.a (mon.0) 324 : audit [INF] from='client.? 172.21.15.184:0/2918680897' entity='client.admin' cmd='[{"prefix": "mgr fail", "who": "x"}]': finished

i.e. the warning MGR_DOWN generates when the mgr fail command is run when there is no active mgr.


Now if we look at 2)

2024-04-10T06:59:57.131 INFO:tasks.cephadm.mgr.x:Stopping mgr.x...
2024-04-10T06:59:57.131 DEBUG:teuthology.orchestra.run.smithi031:> sudo systemctl stop ceph-57e90e8a-f707-11ee-b64a-cb9ed24678a4@mgr.x
2024-04-10T06:59:57.415 INFO:journalctl@ceph.mgr.x.smithi031.stdout:Apr 10 06:59:57 smithi031 systemd[1]: Stopping Ceph mgr.x for 57e90e8a-f707-11ee-b64a-cb9ed24678a4...
2024-04-10T06:59:58.443 INFO:journalctl@ceph.mgr.x.smithi031.stdout:Apr 10 06:59:58 smithi031 systemd[1]: ceph-57e90e8a-f707-11ee-b64a-cb9ed24678a4@mgr.x.service: Deactivated successfully.
2024-04-10T06:59:58.443 INFO:journalctl@ceph.mgr.x.smithi031.stdout:Apr 10 06:59:58 smithi031 systemd[1]: Stopped Ceph mgr.x for 57e90e8a-f707-11ee-b64a-cb9ed24678a4.

(i don't see above lines in (1)'s teuth logs)

2024-04-10T06:59:58.443 INFO:journalctl@ceph.mgr.x.smithi031.stdout:Apr 10 06:59:58 smithi031 systemd[1]: ceph-57e90e8a-f707-11ee-b64a-cb9ed24678a4@mgr.x.service: Consumed 33.490s CPU time.
2024-04-10T06:59:58.444 DEBUG:teuthology.orchestra.run.smithi031:> sudo pkill -f 'journalctl -f -n 0 -u ceph-57e90e8a-f707-11ee-b64a-cb9ed24678a4@mgr.x.service'

and the mgr fail succeeds

2024-04-10T06:59:59.962 INFO:teuthology.orchestra.run.smithi031.stderr:2024-04-10T06:59:59.965+0000 7f7a11ddc640  1 -- 172.21.15.31:0/1174769028 --> [v2:172.21.15.31:3300/0,v1:172.21.15.31:6789/0] -- mon_command({"prefix": "mgr fail", "who": "x"} v 0) -- 0x7f7a0c02d5e0 con 0x7f7a0c102590
2024-04-10T06:59:59.964 INFO:teuthology.orchestra.run.smithi031.stderr:2024-04-10T06:59:59.967+0000 7f7a097fa640  1 -- 172.21.15.31:0/1174769028 <== mon.0 v2:172.21.15.31:3300/0 7 ==== mon_command_ack([{"prefix": "mgr fail", "who": "x"}]=0  v16) ==== 68+0+0 (secure 0 0 0) 0x7f79fc03cc10 con 0x7f7a0c102590

@dparmar18
Copy link
Contributor Author

#56944 (comment) can be reproduced easily:

  • i first start a cluster:
dparmar:build$ ./bin/ceph -s
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2024-04-24T17:12:50.193+0530 7fb290bd06c0 -1 WARNING: all dangerous and experimental features are enabled.
2024-04-24T17:12:50.198+0530 7fb290bd06c0 -1 WARNING: all dangerous and experimental features are enabled.
  cluster:
    id:     8cadbeca-4f35-4ca9-b998-e72350d09fa4
    health: HEALTH_WARN
            Module 'rbd_support' has failed dependency: module 'rbd' has no attribute 'Image'
 
  services:
    mon: 3 daemons, quorum a,b,c (age 52s)
    mgr: x(active, since 47s)
    mds: 1/1 daemons up, 2 standby
    osd: 3 osds: 3 up (since 18s), 3 in (since 33s)
 
  data:
    volumes: 1/1 healthy
    pools:   3 pools, 3 pgs
    objects: 24 objects, 451 KiB
    usage:   3.0 GiB used, 300 GiB / 303 GiB avail
    pgs:     3 active+clean
 
  io:
    client:   1.7 KiB/s wr, 0 op/s rd, 5 op/s wr
  • verify no warnings exist
    dparmar:build$ grep -rnI ./out/ 'MGR_DOWN'

  • kill mgr

dparmar:build$ pgrep ceph-mgr 
738866
dparmar:build$ kill -9 738866
  • wait a bit and check status again
dparmar:build$ ./bin/ceph -s
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2024-04-24T17:16:44.528+0530 7f07a774f6c0 -1 WARNING: all dangerous and experimental features are enabled.
2024-04-24T17:16:44.533+0530 7f07a774f6c0 -1 WARNING: all dangerous and experimental features are enabled.
  cluster:
    id:     8cadbeca-4f35-4ca9-b998-e72350d09fa4
    health: HEALTH_WARN
            no active mgr
            Module 'rbd_support' has failed dependency: module 'rbd' has no attribute 'Image'
 
  services:
    mon: 3 daemons, quorum a,b,c (age 4m)
    mgr: no daemons active (since 107s)
    mds: 1/1 daemons up, 2 standby
    osd: 3 osds: 3 up (since 4m), 3 in (since 4m)
 
  data:
    volumes: 1/1 healthy
    pools:   3 pools, 145 pgs
    objects: 24 objects, 451 KiB
    usage:   3.0 GiB used, 300 GiB / 303 GiB avail
    pgs:     145 active+clean
  • run mgr fail
dparmar:build$ ./bin/ceph mgr fail
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2024-04-24T17:14:56.626+0530 7f08a1d7c6c0 -1 WARNING: all dangerous and experimental features are enabled.
2024-04-24T17:14:56.632+0530 7f08a1d7c6c0 -1 WARNING: all dangerous and experimental features are enabled.
  • grep for MGR_DOWN
dparmar:build$ grep -rnI 'MGR_DOWN' ./out/
./out/mon.a.log:98750:2024-04-24T17:14:56.728+0530 7f08a05f46c0  0 log_channel(cluster) log [WRN] : Health check failed: no active mgr (MGR_DOWN)
./out/mon.a.log:98788:2024-04-24T17:14:56.745+0530 7f08a05f46c0 10 mon.a@0(leader).log v128  logging 2024-04-24T17:14:56.729209+0530 mon.a (mon.0) 240 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)
./out/mon.a.log:99057:2024-04-24T17:14:56.805+0530 7f089ddef6c0  7 mon.a@0(leader).log v129 update_from_paxos applying incremental log 129 2024-04-24T17:14:56.729209+0530 mon.a (mon.0) 240 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)
./out/mon.b.log:38194:2024-04-24T17:14:56.814+0530 7fc3bf9f46c0  7 mon.b@1(peon).log v129 update_from_paxos applying incremental log 129 2024-04-24T17:14:56.729209+0530 mon.a (mon.0) 240 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)
./out/mon.c.log:45422:2024-04-24T17:14:56.813+0530 7f74883f46c0  7 mon.c@2(peon).log v129 update_from_paxos applying incremental log 129 2024-04-24T17:14:56.729209+0530 mon.a (mon.0) 240 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)
./out/cluster.mon.a.log:662:2024-04-24T17:14:56.729209+0530 mon.a (mon.0) 240 : cluster 3 Health check failed: no active mgr (MGR_DOWN)
./out/cluster.mon.b.log:662:2024-04-24T17:14:56.729209+0530 mon.a (mon.0) 240 : cluster 3 Health check failed: no active mgr (MGR_DOWN)
./out/cluster.mon.c.log:662:2024-04-24T17:14:56.729209+0530 mon.a (mon.0) 240 : cluster 3 Health check failed: no active mgr (MGR_DOWN)

@dparmar18
Copy link
Contributor Author

dparmar18 commented Apr 25, 2024

@dparmar18
Copy link
Contributor Author

So below log is from the MGR_DOWN warning run

2024-04-24T07:57:24.232 INFO:tasks.cephadm.mgr.x:Stopping mgr.x...
2024-04-24T07:57:24.232 DEBUG:teuthology.orchestra.run.smithi179:> sudo systemctl stop ceph-7d875f26-020f-11ef-bc93-c7b262605968@mgr.x
2024-04-24T07:57:24.467 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:24 smithi179 ceph-mon[25686]: from='client.? 172.21.15.179:0/2524918627' entity='client.admin' cmd=[{"prefix": "osd pool get", "pool": ".mgr", "var": "pg_num"}]: dispatch
2024-04-24T07:57:24.468 INFO:journalctl@ceph.mgr.x.smithi179.stdout:Apr 24 07:57:24 smithi179 systemd[1]: Stopping Ceph mgr.x for 7d875f26-020f-11ef-bc93-c7b262605968...
2024-04-24T07:57:25.468 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: pgmap v117: 1 pgs: 1 active+clean; 577 KiB data, 82 MiB used, 268 GiB / 268 GiB avail
2024-04-24T07:57:25.468 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: from='client.? 172.21.15.179:0/684086560' entity='client.admin'
2024-04-24T07:57:25.468 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: mds.? [v2:172.21.15.179:6826/3576354581,v1:172.21.15.179:6827/3576354581] up:boot
2024-04-24T07:57:25.468 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: fsmap  1 up:standby
2024-04-24T07:57:25.468 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: from='mgr.14152 172.21.15.179:0/1611738610' entity='mgr.x' cmd=[{"prefix": "mds metadata", "who": "a.smithi179.dthvbt"}]: dispatch
2024-04-24T07:57:25.469 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: from='mgr.14152 172.21.15.179:0/1611738610' entity='mgr.x'
2024-04-24T07:57:25.469 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: from='mgr.14152 172.21.15.179:0/1611738610' entity='mgr.x'
2024-04-24T07:57:25.469 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: from='mgr.14152 172.21.15.179:0/1611738610' entity='mgr.x'
2024-04-24T07:57:25.469 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: from='mgr.14152 172.21.15.179:0/1611738610' entity='mgr.x' cmd=[{"prefix": "auth get-or-create", "entity": "mds.a.smithi179.fefbdj", "caps": ["mon", "profile mds", "osd", "allow rw tag cephfs *=*", "mds", "allow"]}]: dispatch
2024-04-24T07:57:25.469 INFO:journalctl@ceph.mon.a.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-mon[25686]: from='mgr.14152 172.21.15.179:0/1611738610' entity='mgr.x' cmd='[{"prefix": "auth get-or-create", "entity": "mds.a.smithi179.fefbdj", "caps": ["mon", "profile mds", "osd", "allow rw tag cephfs *=*", "mds", "allow"]}]': finished
2024-04-24T07:57:25.469 INFO:journalctl@ceph.mgr.x.smithi179.stdout:Apr 24 07:57:25 smithi179 ceph-7d875f26-020f-11ef-bc93-c7b262605968-mgr-x[25924]: 
2024-04-24T08:26:24.468 INFO:journalctl@ceph.mgr.x.smithi179.stdout:Apr 24 08:26:24 smithi179 systemd[1]: Stopping Ceph mgr.x for 7d875f26-020f-11ef-bc93-c7b262605968...
2024-04-24T08:26:24.912 INFO:journalctl@ceph.mgr.x.smithi179.stdout:Apr 24 08:26:24 smithi179 ceph-7d875f26-020f-11ef-bc93-c7b262605968-mgr-x[57669]: 2024-04-24T08:26:24.509+0000 7fdfd8400700 -1 received  signal: Terminated from /run/podman-init -- /usr/bin/ceph-mgr -n mgr.x -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false  (PID: 1) UID: 0
2024-04-24T08:26:24.912 INFO:journalctl@ceph.mgr.x.smithi179.stdout:Apr 24 08:26:24 smithi179 ceph-7d875f26-020f-11ef-bc93-c7b262605968-mgr-x[57669]: 2024-04-24T08:26:24.509+0000 7fdfd8400700 -1 mgr handle_mgr_signal  *** Got signal Terminated ***
2024-04-24T08:26:24.912 INFO:journalctl@ceph.mgr.x.smithi179.stdout:Apr 24 08:26:24 smithi179 podman[166144]: 2024-04-24 08:26:24.602345947 +0000 UTC m=+0.221365836 container died 05a72e0ed9e3da800075059d99e15165c2e8d74aac0c557b68aa38d371199720 (image=quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:1f81fb397ae98da7563d451a78c61574c8f4e6e0, name=ceph-7d875f26-020f-11ef-bc93-c7b262605968-mgr-x, GIT_REPO=git@github.com:ceph/ceph-container.git, maintainer=Guillaume Abrioux <gabrioux@redhat.com>, org.label-schema.license=GPLv2, ceph=True, RELEASE=rishabh-main-apr17-a654945-1f81fb3, org.label-schema.schema-version=1.0, GIT_BRANCH=HEAD, io.buildah.version=1.33.5, org.label-schema.name=CentOS Stream 8 Base Image, org.label-schema.build-date=20240424, GIT_CLEAN=True, GIT_COMMIT=e83daab126629de4b83b70f7819e84da953b5b67, org.label-schema.vendor=CentOS, CEPH_POINT_RELEASE=)
2024-04-24T08:26:25.218 INFO:journalctl@ceph.mgr.x.smithi179.stdout:Apr 24 08:26:25 smithi179 podman[166144]: 2024-04-24 08:26:25.062096408 +0000 UTC m=+0.681116337 container cleanup 05a72e0ed9e3da800075059d99e15165c2e8d74aac0c557b68aa38d371199720 (image=quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:1f81fb397ae98da7563d451a78c61574c8f4e6e0, name=ceph-7d875f26-020f-11ef-bc93-c7b262605968-mgr-x, org.label-schema.schema-version=1.0, CEPH_POINT_RELEASE=, GIT_CLEAN=True, maintainer=Guillaume Abrioux <gabrioux@redhat.com>, GIT_COMMIT=e83daab126629de4b83b70f7819e84da953b5b67, RELEASE=rishabh-main-apr17-a654945-1f81fb3, org.label-schema.name=CentOS Stream 8 Base Image, org.label-schema.build-date=20240424, org.label-schema.license=GPLv2, org.label-schema.vendor=CentOS, GIT_BRANCH=HEAD, io.buildah.version=1.33.5, GIT_REPO=git@github.com:ceph/ceph-container.git, ceph=True)
2024-04-24T08:26:25.218 INFO:journalctl@ceph.mgr.x.smithi179.stdout:Apr 24 08:26:25 smithi179 bash[166144]: ceph-7d875f26-020f-11ef-bc93-c7b262605968-mgr-x
2024-04-24T08:26:25.718 INFO:journalctl@ceph.mgr.x.smithi179.stdout:Apr 24 08:26:25 smithi179 podman[166157]: 2024-04-24 08:26:25.305368421 +0000 UTC m=+0.683065612 container remove 05a72e0ed9e3da800075059d99e15165c2e8d74aac0c557b68aa38d371199720 (image=quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:1f81fb397ae98da7563d451a78c61574c8f4e6e0, name=ceph-7d875f26-020f-11ef-bc93-c7b262605968-mgr-x, org.label-schema.schema-version=1.0, org.label-schema.license=GPLv2, GIT_BRANCH=HEAD, org.label-schema.vendor=CentOS, org.label-schema.name=CentOS Stream 8 Base Image, RELEASE=rishabh-main-apr17-a654945-1f81fb3, GIT_CLEAN=True, io.buildah.version=1.33.5, maintainer=Guillaume Abrioux <gabrioux@redhat.com>, org.label-schema.build-date=20240424, GIT_REPO=git@github.com:ceph/ceph-container.git, GIT_COMMIT=e83daab126629de4b83b70f7819e84da953b5b67, ceph=True, CEPH_POINT_RELEASE=)
2024-04-24T08:26:25.798 DEBUG:teuthology.orchestra.run.smithi179:> sudo pkill -f 'journalctl -f -n 0 -u ceph-7d875f26-020f-11ef-bc93-c7b262605968@mgr.x.service'
2024-04-24T08:26:25.900 DEBUG:teuthology.orchestra.run:got remote process result: None
2024-04-24T08:26:25.900 INFO:tasks.cephadm.mgr.x:Stopped mgr.x

and this is from my nfs run where there was no MGR_DOWN detected

2024-04-25T08:23:50.790 INFO:tasks.cephadm.mgr.x:Stopping mgr.x...
2024-04-25T08:23:50.790 DEBUG:teuthology.orchestra.run.smithi187:> sudo systemctl stop ceph-5e8a0838-02d9-11ef-bc93-c7b262605968@mgr.x
2024-04-25T08:23:51.141 INFO:journalctl@ceph.mgr.x.smithi187.stdout:Apr 25 08:23:50 smithi187 systemd[1]: Stopping Ceph mgr.x for 5e8a0838-02d9-11ef-bc93-c7b262605968...
2024-04-25T08:23:51.475 INFO:journalctl@ceph.mgr.x.smithi187.stdout:Apr 25 08:23:51 smithi187 ceph-5e8a0838-02d9-11ef-bc93-c7b262605968-mgr-x[57042]:
2024-04-25T08:23:51.141+0000 7f54b1741700 -1 received  signal: Terminated from /run/podman-init -- /usr/bin/ceph-mgr -n mgr.x -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false  (PID: 1) UID: 0
2024-04-25T08:23:51.475 INFO:journalctl@ceph.mgr.x.smithi187.stdout:Apr 25 08:23:51 smithi187 podman[163721]: 2024-04-25 08:23:51.17220845 +0000 UTC m=+0.155629819 container died 5bb9877c1f8812bf06f7076c41d60f35eb3e6353987a9e0d26d056a88311eaf2 (image=quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:bfd6349fc33916a8c245f0a9ad7299ef30071853, name=ceph-5e8a0838-02d9-11ef-bc93-c7b262605968-mgr-x, org.label-schema.vendor=CentOS, org.label-schema.license=GPLv2, CEPH_POINT_RELEASE=, org.label-schema.name=CentOS Stream 8 Base Image, org.label-schema.schema-version=1.0, ceph=True, RELEASE=dparmar-24-apr-main-bfd6349, io.buildah.version=1.33.5, GIT_REPO=git@github.com:ceph/ceph-container.git, org.label-schema.build-date=20240424, GIT_BRANCH=HEAD, GIT_COMMIT=e83daab126629de4b83b70f7819e84da953b5b67, maintainer=Guillaume Abrioux <gabrioux@redhat.com>, GIT_CLEAN=True)
2024-04-25T08:23:51.784 INFO:journalctl@ceph.mgr.x.smithi187.stdout:Apr 25 08:23:51 smithi187 podman[163721]: 2024-04-25 08:23:51.558381263 +0000 UTC m=+0.541802624 container cleanup 5bb9877c1f8812bf06f7076c41d60f35eb3e6353987a9e0d26d056a88311eaf2 (image=quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:bfd6349fc33916a8c245f0a9ad7299ef30071853, name=ceph-5e8a0838-02d9-11ef-bc93-c7b262605968-mgr-x, ceph=True, maintainer=Guillaume Abrioux <gabrioux@redhat.com>, org.label-schema.build-date=20240424, GIT_CLEAN=True, CEPH_POINT_RELEASE=, GIT_COMMIT=e83daab126629de4b83b70f7819e84da953b5b67, org.label-schema.vendor=CentOS, org.label-schema.license=GPLv2, GIT_BRANCH=HEAD, org.label-schema.name=CentOS Stream 8 Base Image, GIT_REPO=git@github.com:ceph/ceph-container.git, RELEASE=dparmar-24-apr-main-bfd6349, io.buildah.version=1.33.5, org.label-schema.schema-version=1.0)
2024-04-25T08:23:51.784 INFO:journalctl@ceph.mgr.x.smithi187.stdout:Apr 25 08:23:51 smithi187 bash[163721]: ceph-5e8a0838-02d9-11ef-bc93-c7b262605968-mgr-x
2024-04-25T08:23:51.918 DEBUG:teuthology.orchestra.run.smithi187:> sudo pkill -f 'journalctl -f -n 0 -u ceph-5e8a0838-02d9-11ef-bc93-c7b262605968@mgr.x.service'
2024-04-25T08:23:51.952 DEBUG:teuthology.orchestra.run:got remote process result: None
2024-04-25T08:23:51.953 INFO:tasks.cephadm.mgr.x:Stopped mgr.x

the failed run called this

static void handle_mgr_signal(int signum)
{
  derr << " *** Got signal " << sig_str(signum) << " ***" << dendl;

  // The python modules don't reliably shut down, so don't even
  // try. The mon will blocklist us (and all of our rados/cephfs
  // clients) anyway. Just exit!

  _exit(0);  // exit with 0 result code, as if we had done an orderly shutdown
}

while the one that passed didn't.

i've ran 8 jobs till now(#56944 (comment)), none of them experienced this. This looks to me like an intermittent failure. the mgr didn't shut down well which the static void handle_mgr_signal()'s comment also indicates

@dparmar18
Copy link
Contributor Author

dparmar18 commented Apr 25, 2024

mgr fail is run by qa/tasks/mgr/mgr_test_case.py as a part to setup mgrs and the reason why this is prevalent only in nfs suite is because it is the only suite in fs that makes use of class MgrTestCase

@dparmar18 dparmar18 force-pushed the fix-mgr-down-warning-nfs-suite branch from 3367702 to c6bfd0b Compare April 26, 2024 11:03
@dparmar18 dparmar18 requested review from a team and batrick April 26, 2024 11:04
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise lgtm

@@ -0,0 +1,7 @@
# after some recent clog changes, MGR_DOWN warnings are generated before the
# NFS test cases execution begins and this leads to the job getting failed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's sufficient to say MGR_DOWN is required because mgr fail will restart the mgr between tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dparmar18 I've tagged this with my testing branch. Please resolve this comment soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's sufficient to say MGR_DOWN is required because mgr fail will restart the mgr between tests.

mgr fail happens before the tests are run, i haven't seen them between tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2024-03-27T06:07:34.833 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:07:34 smithi184 bash[21504]: audit 2024-03-27T06:07:34.221017+0000 mon.a (mon.0) 324 : audit [INF] from='client.? 172.21.15.184:0/2918680897' entity='client.admin' cmd='[{"prefix": "mgr fail", "who": "x"}]': finished
2024-03-27T06:07:52.025 INFO:tasks.cephfs_test_runner:Starting test: test_cephfs_export_update_at_non_dir_path (tasks.cephfs.test_nfs.TestNFS)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's sufficient to say MGR_DOWN is required because mgr fail will restart the mgr between tests.

mgr fail happens before the tests are run, i haven't seen them between tests.

Right, the MgrTestCase.setup_mgrs is called only when the test class is constructed. Anyway, please fix your comment.

vshankar added a commit to vshankar/ceph that referenced this pull request May 6, 2024
* refs/pull/56944/head:
	qa: add a YAML to ignore MGR_DOWN warning

Reviewed-by: Venky Shankar <vshankar@redhat.com>
RCA showed that it is not the NFS code that lead to the warning since the
warning occurred before the test cases started to execute, later on after
some discussion with the venky and greg, it was found that there were some
clog changes made recently which leads to this warning being added to the
clog.

Digging more further, it was found that the warning is generated when mgr fail
is run when there is no mgr available. The reason for unavailability is when
`setup_mgrs()` in class `MgrTestCase` stops the mgr daemons, sometimes the mgr
just crashes - `mgr handle_mgr_signal  *** Got signal Terminated ***`  and
after which `mgr fail` (again part of `setup_mgrs()`) is run and the `MGR_DOWN`
warning is generated.

This warning is only evident in nfs is because this is the only fs suite that
makes use of class `MgrTestCase`. To support my analysis, I had ran about eight
jobs in teuthology and I could not reproduce this warning. Since this is not
harming the NFS test cases execution and the logs do mention that the mgr
daemon did get restarted (`INFO:tasks.cephadm.mgr.x:Restarting mgr.x
(starting--it wasn't running)...`), it is good to conclude that ignoring this
warning is the simplest solution.

Fixes: https://tracker.ceph.com/issues/65265
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
@dparmar18 dparmar18 force-pushed the fix-mgr-down-warning-nfs-suite branch from c6bfd0b to 7d954ce Compare May 8, 2024 10:49
@vshankar
Copy link
Contributor

vshankar commented May 9, 2024

https://pulpito.ceph.com/vshankar-2024-05-07_03:44:24-fs-wip-vshankar-testing-20240506.153513-testing-default-smithi/

(unfortunately, failed are infra issues related to cephadm - would need a rebuild)

@vshankar
Copy link
Contributor

vshankar commented May 9, 2024

jenkins retest this please

@vshankar
Copy link
Contributor

vshankar commented May 9, 2024

This PR is under test in https://tracker.ceph.com/issues/65882.

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar
Copy link
Contributor

jenkins test windows

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants