Fix: patch flaky tests from submit-upgrade-test-cloud-build#4236
Fix: patch flaky tests from submit-upgrade-test-cloud-build#4236igooch merged 4 commits intoagones-dev:mainfrom
Conversation
|
Build Succeeded 🥳 Build Id: 47cfb212-6fc5-41f8-b760-a1d0e3b00943 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version: |
|
/gcbrun |
|
Build Succeeded 🥳 Build Id: 28043ba6-150c-4dec-a2e7-078d803f04eb The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version: |
|
Build Succeeded 🥳 Build Id: 1a50399c-1717-4d34-8458-6a7edf7dd5db The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version: |
test/upgrade/main.go
Outdated
| cancel() | ||
| // TODO: Replace sleep with wait for the existing healthy Game Servers finish naturally by reaching their shutdown phase. | ||
| time.Sleep(30 * time.Second) | ||
|
|
There was a problem hiding this comment.
Whoops! That one fell through the cracks. Glad to the TODO was there to provide guideance.
LGTM, but would like @igooch to triple check as she was the one who originally wrote this.
|
Build Failed 😭 Build Id: bd56edee-9493-489d-b52a-df61519c557a Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
|
Well, that's new.. https://console.cloud.google.com/cloud-build/builds/5096077b-a0db-473a-97d0-87320db99856;step=0?inv=1&invt=Ab4lyw&project=agones-images |
|
/gcbrun |
|
Build Failed 😭 Build Id: f97e2dc3-136a-4588-8fe7-d8fea6e93977 Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
|
Seems that it's still flacky, it's not always the same error though |
Haven't seen this one before. It might be an issue with Helm version, or possibly with the way the variable is declared: https://github.com/search?q=repo%3Agoogleforgames%2Fagones%20HELM_URL&type=code. In any case these setups should all be the same, and they've diverged. |
|
Build Failed 😭 Build Id: 12efc4ea-908a-4a60-918a-1737916f3fd2 Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
The most recent failure is We have a log (times are UTC) for deleting a job that was still running: But no log for Which is odd. There's also no In general though it seems like the failed test runs are not cleaning up properly. For this PR I'd suggest that we wait for all non-unheathly game servers, including those in Shutdown state to finish before continuing. So probably doing the inverse, i.e. wait until there are no game servers or all known game servers are "Unhealthy". |
That looks like get.helm.sh was just not playing nice. So external dependency being flaky. Not an us problem. |
Is what I saw in the last log -- is that the issue? there was some issue deleting the namespace? |
|
/gcbrun |
|
Build Succeeded 🥳 Build Id: d09c5852-6f8a-4875-b839-04954609024f The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version: |
test/upgrade/main.go
Outdated
| // Count Game Servers that are still in healthy states (not shutdown/terminated) | ||
| healthyCount := 0 | ||
| for _, gs := range gameServers.Items { | ||
| state := gs.Status.State | ||
| // Consider Game Servers healthy if they're in Ready, Allocated, or Reserved states | ||
| if state == agonesv1.GameServerStateReady || | ||
| state == agonesv1.GameServerStateAllocated || | ||
| state == agonesv1.GameServerStateReserved { | ||
| healthyCount++ | ||
| } | ||
| } | ||
|
|
||
| log.Printf("Waiting for %d healthy Game Servers to shutdown", healthyCount) | ||
|
|
||
| // Return true when no healthy Game Servers remain | ||
| return healthyCount == 0, nil |
There was a problem hiding this comment.
| // Count Game Servers that are still in healthy states (not shutdown/terminated) | |
| healthyCount := 0 | |
| for _, gs := range gameServers.Items { | |
| state := gs.Status.State | |
| // Consider Game Servers healthy if they're in Ready, Allocated, or Reserved states | |
| if state == agonesv1.GameServerStateReady || | |
| state == agonesv1.GameServerStateAllocated || | |
| state == agonesv1.GameServerStateReserved { | |
| healthyCount++ | |
| } | |
| } | |
| log.Printf("Waiting for %d healthy Game Servers to shutdown", healthyCount) | |
| // Return true when no healthy Game Servers remain | |
| return healthyCount == 0, nil | |
| return len(gameServers) == 0, nil |
Is what we're saying?
Out of curiosity - why do we wait for GameServers to be totally deleted?
There was a problem hiding this comment.
Yea, not fully sure anymore
The call just after is cleanUpResources(), which already delete gs etc.: https://github.com/googleforgames/agones/blob/231024b332b723b6b60be3ed9057ef42715e85b5/test/upgrade/main.go#L564C1-L592C1
If we want we can do some check there to ensure everything is cleaned-up in between steps, to avoid the uninstall failing / timing out if there is still gs running ? 🤔
There was a problem hiding this comment.
It's a guess, but the issue could be that:
We delete the GS using kubectl, they are in shutdown etc. Them we delete the helm release without waiting that it's done, which start but never finish and fails on delete namespace as there is still resources on it ?
The issue being that we dont wait between the delete gs and the unisntall of the helm release, we might set a finalizers and it never be deleted as the helm uninstall doesn't wait that the resources are cleaned up, this might be the explanation about why we want to wait that they are either non ready or deleted
There was a problem hiding this comment.
Ill apply this changes on the clean resources fonction, to wait that the GS are cleaned up before uninstalling the helmrelease
There was a problem hiding this comment.
Actually, we could just do a helm uninstall agones -n agones-system --wait --timeout which should wait that all resources are deleted, on need to "wait" for the game servers to be terminated manually
There was a problem hiding this comment.
@igooch correct me if I'm wrong, but does that defeat the purpose of the test - which is to test in place upgrades.
But I've not dug that deep into what's going on in this test, so take what I say with a large grain of salt.
There was a problem hiding this comment.
From what I understood it's only after the test that runs the in place upgrades (runConfigWalker) which is called just before the cleanup: here, at the really end, once the test is done
But I might have missed something or miss-understood some part 😄
4d12197 to
376240f
Compare
|
Build Failed 😭 Build Id: e235f4bb-d7e1-4fec-8a1e-dcac0849ac2f Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
|
Build Succeeded 🥳 Build Id: bdbba812-6297-47b1-870e-2be6367530de The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version: |
|
Also possible I have no idea 😁 |
|
/gcbrun |
|
Build Failed 😭 Build Id: 7c27b5f6-077d-42f7-886c-90a5efdc99dd Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
|
Could also be there are multiple flakes in here. |
|
This error might gonna be fixed by this PR : #4224 |
|
/gcbrun |
|
Build Succeeded 🥳 Build Id: 6d155009-00dc-43b2-8499-de68fad42baf The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version: |
What type of PR is this?
/kind bug
What this PR does / Why we need it:
Fix flaky test from submit-upgrade-test-cloud-build
Ensure the gameserver are shuting down before uninstall helmrelease
Which issue(s) this PR fixes:
Closes #4235
Special notes for your reviewer: