Skip to content

qa/upgrade: fix checks to make sure upgrade is still in progress#58605

Merged
adk3798 merged 2 commits intoceph:mainfrom
adk3798:upgrade-suite-upgrade-in-progress-checks
Aug 9, 2024
Merged

qa/upgrade: fix checks to make sure upgrade is still in progress#58605
adk3798 merged 2 commits intoceph:mainfrom
adk3798:upgrade-suite-upgrade-in-progress-checks

Conversation

@adk3798
Copy link
Contributor

@adk3798 adk3798 commented Jul 15, 2024

Without checking both for the upgrade being in progress and that
the status isn't reporting an error, we can end up in a scenario
where the test is just waiting for an upgrade that has already
been marked failed and will never complete. This same sort of
change was already done in the orch suite upgrade tests and
has helped with jobs timing out there

Fixes: https://tracker.ceph.com/issues/65546

This also updates the reef-x stress-split test to make use of staggered
upgrade parameters since we can be sure any given reef image has
access to them

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

adk3798 added 2 commits July 15, 2024 15:02
Without checking both for the upgrade being in progress and that
the status isn't reporting an error, we can end up in a scenario
where the test is just waiting for an upgrade that has already
been marked failed and will never complete. This same sort of
change was already done in the orch suite upgrade tests and
has helped with jobs timing out there

Fixes: https://tracker.ceph.com/issues/65546

Signed-off-by: Adam King <adking@redhat.com>
This test was trying to partially upgrade the mons and OSDs by
kicking off an upgrade and then checking every 2 seconds if
enough had been upgraded. Since staggered upgrade parameters
were present in the initial reef release (not true for quincy)
it makes sense to use them instead in order to do this in a
more controlled manner.

Signed-off-by: Adam King <adking@redhat.com>
@adk3798
Copy link
Contributor Author

adk3798 commented Jul 15, 2024

@batrick this should help with some of the upgrade test timeouts you brought up in the CLT call

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking this up @adk3798! I assume you're running this through the upgrade suites?

@adk3798
Copy link
Contributor Author

adk3798 commented Jul 15, 2024

Thank you for taking this up @adk3798! I assume you're running this through the upgrade suites?

yeah, I'm just going to include it in a build with a bunch of PRs for an orch run and then also run the upgrade suite

@batrick
Copy link
Member

batrick commented Jul 26, 2024

jenkins test dashboard cephadm

@batrick
Copy link
Member

batrick commented Jul 26, 2024

Adding this to my batch too for fun. Don't wait on me.

@batrick
Copy link
Member

batrick commented Jul 26, 2024

This PR is under test in https://tracker.ceph.com/issues/67214.

@batrick
Copy link
Member

batrick commented Jul 27, 2024

jenkins test dashboard cephadm

@nizamial09
Copy link
Member

dashboard cephadm e2e started breaking recently after #56331.
We have a tracker and I'll be circling back to fix this later. https://tracker.ceph.com/issues/66491

@batrick
Copy link
Member

batrick commented Aug 6, 2024

jenkins test dashboard cephadm

@batrick
Copy link
Member

batrick commented Aug 8, 2024

Sigh. upgrade suite is a mess right now so it's hard to evaluate this PR.

https://pulpito.ceph.com/?suite=upgrade

@adk3798 if you're satisfied this hasn't obviously broken anything and will fix the tracker ticket, you have my blessing to merge.

@adk3798
Copy link
Contributor Author

adk3798 commented Aug 9, 2024

Sigh. upgrade suite is a mess right now so it's hard to evaluate this PR.

https://pulpito.ceph.com/?suite=upgrade

@adk3798 if you're satisfied this hasn't obviously broken anything and will fix the tracker ticket, you have my blessing to merge.

Alright, I'm pretty sure the main issues with the suite are the thrashosds task timing out (that's why all the stress-split jobs still die) and missing ignorelist entries. This PR should still be good for what it was intended to fix.

@adk3798 adk3798 merged commit 528a1eb into ceph:main Aug 9, 2024
@adk3798
Copy link
Contributor Author

adk3798 commented Aug 9, 2024

Sigh. upgrade suite is a mess right now so it's hard to evaluate this PR.
https://pulpito.ceph.com/?suite=upgrade
@adk3798 if you're satisfied this hasn't obviously broken anything and will fix the tracker ticket, you have my blessing to merge.

Alright, I'm pretty sure the main issues with the suite are the thrashosds task timing out (that's why all the stress-split jobs still die) and missing ignorelist entries. This PR should still be good for what it was intended to fix.

it's possible the thrashosds task issue is https://tracker.ceph.com/issues/66698, but I haven't been able to confirm yet. Going to rerun one of the stress-split jobs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants