qa/suites/fs: stop looping in mds upgrade test if upgrade failed#45361
qa/suites/fs: stop looping in mds upgrade test if upgrade failed#45361adk3798 merged 1 commit intoceph:masterfrom
Conversation
|
1 failure due to machine not being locked, unrelated infra issue Didn't hit the actual upgrade failure I was looking for. Need another run. |
| env: [sha1] | ||
| host.a: | ||
| - while ceph orch upgrade status | jq '.in_progress' | grep true ; do ceph orch ps ; ceph versions ; ceph fs dump; sleep 30 ; done | ||
| - while ceph orch upgrade status | jq '.in_progress' | grep true && ! ceph orch upgrade status | jq '.message' | grep Error ; do ceph orch ps ; ceph versions ; ceph fs dump; ceph orch upgrade status ; sleep 30 ; done |
There was a problem hiding this comment.
Suggest also just exit 1 after 30 minutes too. No reason to let this run for 12h if something gets stuck.
There was a problem hiding this comment.
added a 30 minute timeout on the command
EDIT: later removed it, as it was causing the test to be run with the old while statement for whatever reason. Will need to figure out a new way to do the timeout at some point.
cf6e990 to
a21de1b
Compare
|
jenkins test make check |
1 similar comment
|
jenkins test make check |
|
@vshankar I haven't been able to actually hit the upgrade issue while testing this. What do you think about just merging it in then if you see the test fail again you can ping me so I can take a look? It will have more info I can use for debugging and will at least fail in 30 minutes rather than 6 hours. |
Absolutely. Let's merge this. I'll let you know how further tests look... |
|
jenkins test make check |
|
jenkins test make check |
|
Just noticing in this run http://pulpito.front.sepia.ceph.com/adking-2022-03-15_17:49:30-orch:cephadm:mds_upgrade_sequence-wip-adk2-testing-2022-03-15-0949-distro-basic-smithi/ after the timeout was added it seems to just be running the unchanged test but pre timeout it took the changes http://pulpito.front.sepia.ceph.com/adking-2022-03-14_12:43:51-orch:cephadm:mds_upgrade_sequence-wip-adk2-testing-2022-03-11-1538-distro-basic-smithi/ It looks like with the timeout added to the front this doesn't actually work? It just ignored the changes entirely. Not sure how this works but at least it doesn't seem to like using timeout like this. |
|
@adk3798 Noticed that just now -- https://pulpito.ceph.com/vshankar-2022-03-16_09:42:54-fs:upgrade-wip-vshankar-testing-20220316-102808-testing-default-smithi/6739031/ I can't even see (in teuthology log) the |
|
@adk3798 any progress on this? |
Signed-off-by: Adam King <adking@redhat.com>
a21de1b to
37019aa
Compare
|
Failures: |
|
jenkins test make check |
|
@vshankar do we want to merge this? |
I'm ok with this change if the run fine. |
Signed-off-by: Adam King adking@redhat.com
Testing for https://tracker.ceph.com/issues/54419
Also possibly something we want to actually merge depending on how the results go here. Having the test fail faster if the upgrade is failed for whatever reason would be a big improvement over running for 6 hours until the job is marked dead.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows