qa: add a YAML to ignore MGR_DOWN warning#56944
Conversation
6268392 to
b3a32d9
Compare
|
i still see the |
|
still evident even after adding one more MGR https://pulpito.ceph.com/dparmar-2024-04-18_09:23:37-fs:nfs-wip-65265-distro-default-smithi/ |
|
I guess adding this warning to allowlist is the only option |
|
so |
b3a32d9 to
16695f1
Compare
16695f1 to
3367702
Compare
batrick
left a comment
There was a problem hiding this comment.
RCA showed that it is not the NFS code that lead to the warning since
the warning occured before the test cases started to execute, later on
after some discussion with the venky and greg, it was found that there
were some clog changes made recently which leads to this warning being
added to the clog.
Yes, the clog checks are failing the job. But why is the mgr getting restarted (and how) to cause MGR_DOWN? That is the root cause that needs identified.
|
Okay we will take example of two runs:
Both run with exact same YAML matrix except the distro:
In 1), then somehow it crashed and when i.e. the warning Now if we look at 2) (i don't see above lines in (1)'s teuth logs) and the |
|
#56944 (comment) can be reproduced easily:
|
|
I ran a couple of NFS jobs, no MGR_DOWN reported https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/ MGR_DOWN is reported when the |
|
So below log is from the MGR_DOWN warning run and this is from my nfs run where there was no MGR_DOWN detected the failed run called this while the one that passed didn't. i've ran 8 jobs till now(#56944 (comment)), none of them experienced this. This looks to me like an intermittent failure. the |
|
|
3367702 to
c6bfd0b
Compare
| @@ -0,0 +1,7 @@ | |||
| # after some recent clog changes, MGR_DOWN warnings are generated before the | |||
| # NFS test cases execution begins and this leads to the job getting failed. | |||
There was a problem hiding this comment.
It's sufficient to say MGR_DOWN is required because mgr fail will restart the mgr between tests.
There was a problem hiding this comment.
@dparmar18 I've tagged this with my testing branch. Please resolve this comment soon.
There was a problem hiding this comment.
It's sufficient to say MGR_DOWN is required because
mgr failwill restart the mgr between tests.
mgr fail happens before the tests are run, i haven't seen them between tests.
There was a problem hiding this comment.
2024-03-27T06:07:34.833 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:07:34 smithi184 bash[21504]: audit 2024-03-27T06:07:34.221017+0000 mon.a (mon.0) 324 : audit [INF] from='client.? 172.21.15.184:0/2918680897' entity='client.admin' cmd='[{"prefix": "mgr fail", "who": "x"}]': finished
2024-03-27T06:07:52.025 INFO:tasks.cephfs_test_runner:Starting test: test_cephfs_export_update_at_non_dir_path (tasks.cephfs.test_nfs.TestNFS)
There was a problem hiding this comment.
It's sufficient to say MGR_DOWN is required because
mgr failwill restart the mgr between tests.
mgr failhappens before the tests are run, i haven't seen them between tests.
Right, the MgrTestCase.setup_mgrs is called only when the test class is constructed. Anyway, please fix your comment.
* refs/pull/56944/head: qa: add a YAML to ignore MGR_DOWN warning Reviewed-by: Venky Shankar <vshankar@redhat.com>
RCA showed that it is not the NFS code that lead to the warning since the warning occurred before the test cases started to execute, later on after some discussion with the venky and greg, it was found that there were some clog changes made recently which leads to this warning being added to the clog. Digging more further, it was found that the warning is generated when mgr fail is run when there is no mgr available. The reason for unavailability is when `setup_mgrs()` in class `MgrTestCase` stops the mgr daemons, sometimes the mgr just crashes - `mgr handle_mgr_signal *** Got signal Terminated ***` and after which `mgr fail` (again part of `setup_mgrs()`) is run and the `MGR_DOWN` warning is generated. This warning is only evident in nfs is because this is the only fs suite that makes use of class `MgrTestCase`. To support my analysis, I had ran about eight jobs in teuthology and I could not reproduce this warning. Since this is not harming the NFS test cases execution and the logs do mention that the mgr daemon did get restarted (`INFO:tasks.cephadm.mgr.x:Restarting mgr.x (starting--it wasn't running)...`), it is good to conclude that ignoring this warning is the simplest solution. Fixes: https://tracker.ceph.com/issues/65265 Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
c6bfd0b to
7d954ce
Compare
|
(unfortunately, failed are infra issues related to cephadm - would need a rebuild) |
|
jenkins retest this please |
|
This PR is under test in https://tracker.ceph.com/issues/65882. |
|
jenkins test windows |
RCA showed that it is not the NFS code that lead to the warning since the warning occurred before the test cases started to execute, later on after some discussion with the venky and greg, it was found that there were some clog changes made recently which leads to this warning being added to the clog.
Digging more further, it was found that the warning is generated when
mgr failis run when there is no mgr available. The reason for unavailability is whensetup_mgrs()in classMgrTestCasestops the mgr daemons, sometimes the mgr just crashes -mgr handle_mgr_signal *** Got signal Terminated ***and after whichmgr fail(again part ofsetup_mgrs()) is run and theMGR_DOWNwarning is generated.This warning is only evident in nfs is because this is the only fs suite that makes use of class
MgrTestCase. To support my analysis, I had ran about 8 jobs in teuthology and I could not reproduce this warning. Since this is not harming the NFS test cases execution and the logs do mention that the mgr daemon did get restarted (INFO:tasks.cephadm.mgr.x:Restarting mgr.x (starting--it wasn't running)...), it is good to conclude that ignoring this warning is the simplest solution.Fixes: https://tracker.ceph.com/issues/65265
Signed-off-by: Dhairya Parmar dparmar@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e