Skip to content

Fix: patch flaky tests from submit-upgrade-test-cloud-build#4236

Merged
igooch merged 4 commits intoagones-dev:mainfrom
lacroixthomas:bugfixes/fix-flacky-test-25
Aug 12, 2025
Merged

Fix: patch flaky tests from submit-upgrade-test-cloud-build#4236
igooch merged 4 commits intoagones-dev:mainfrom
lacroixthomas:bugfixes/fix-flacky-test-25

Conversation

@lacroixthomas
Copy link
Copy Markdown
Collaborator

What type of PR is this?
/kind bug

What this PR does / Why we need it:
Fix flaky test from submit-upgrade-test-cloud-build

Ensure the gameserver are shuting down before uninstall helmrelease

Which issue(s) this PR fixes:

Closes #4235

Special notes for your reviewer:

@github-actions github-actions bot added kind/bug These are bugs. size/S labels Jul 31, 2025
@markmandel markmandel requested a review from igooch July 31, 2025 22:07
@agones-bot
Copy link
Copy Markdown
Collaborator

Build Succeeded 🥳

Build Id: 47cfb212-6fc5-41f8-b760-a1d0e3b00943

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4236/head:pr_4236 && git checkout pr_4236
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.51.0-dev-de2296c

@lacroixthomas
Copy link
Copy Markdown
Collaborator Author

/gcbrun

@agones-bot
Copy link
Copy Markdown
Collaborator

Build Succeeded 🥳

Build Id: 28043ba6-150c-4dec-a2e7-078d803f04eb

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4236/head:pr_4236 && git checkout pr_4236
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.51.0-dev-de2296c

@agones-bot
Copy link
Copy Markdown
Collaborator

Build Succeeded 🥳

Build Id: 1a50399c-1717-4d34-8458-6a7edf7dd5db

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4236/head:pr_4236 && git checkout pr_4236
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.52.0-dev-d40fbb7

cancel()
// TODO: Replace sleep with wait for the existing healthy Game Servers finish naturally by reaching their shutdown phase.
time.Sleep(30 * time.Second)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops! That one fell through the cracks. Glad to the TODO was there to provide guideance.

LGTM, but would like @igooch to triple check as she was the one who originally wrote this.

@agones-bot
Copy link
Copy Markdown
Collaborator

Build Failed 😭

Build Id: bd56edee-9493-489d-b52a-df61519c557a

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@lacroixthomas
Copy link
Copy Markdown
Collaborator Author

Well, that's new.. https://console.cloud.google.com/cloud-build/builds/5096077b-a0db-473a-97d0-87320db99856;step=0?inv=1&invt=Ab4lyw&project=agones-images

ERROR: failed to solve: executor failed running [/bin/sh -c curl -L  ${HELM_URL} > /tmp/helm.tar.gz     && tar -zxvf /tmp/helm.tar.gz -C /tmp     && mv /tmp/linux-amd64/helm /usr/local/bin/helm     && chmod go+rx /usr/local/bin/helm     && rm /tmp/helm.tar.gz && rm -rf /tmp/linux-amd64]: exit code: 7
------
0.402 curl: (7) Failed to connect to get.helm.sh port 443 after 23 ms: Connection refused
0.386 
0.382                                  Dload  Upload   Total   Spent    Left  Speed
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
 > [ 6/11] RUN curl -L  https://get.helm.sh/helm-v3.5.0-linux-amd64.tar.gz > /tmp/helm.tar.gz     && tar -zxvf /tmp/helm.tar.gz -C /tmp     && mv /tmp/linux-amd64/helm /usr/local/bin/helm     && chmod go+rx /usr/local/bin/helm     && rm /tmp/helm.tar.gz && rm -rf /tmp/linux-amd64:
------
#10 ERROR: executor failed running [/bin/sh -c curl -L  ${HELM_URL} > /tmp/helm.tar.gz     && tar -zxvf /tmp/helm.tar.gz -C /tmp     && mv /tmp/linux-amd64/helm /usr/local/bin/helm     && chmod go+rx /usr/local/bin/helm     && rm /tmp/helm.tar.gz && rm -rf /tmp/linux-amd64]: exit code: 7
#10 0.402 curl: (7) Failed to connect to get.helm.sh port 443 after 23 ms: Connection refused

@lacroixthomas
Copy link
Copy Markdown
Collaborator Author

/gcbrun

@agones-bot
Copy link
Copy Markdown
Collaborator

Build Failed 😭

Build Id: f97e2dc3-136a-4588-8fe7-d8fea6e93977

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@lacroixthomas
Copy link
Copy Markdown
Collaborator Author

lacroixthomas commented Aug 4, 2025

Seems that it's still flacky, it's not always the same error though

@igooch
Copy link
Copy Markdown
Collaborator

igooch commented Aug 4, 2025

Well, that's new.. https://console.cloud.google.com/cloud-build/builds/5096077b-a0db-473a-97d0-87320db99856;step=0?inv=1&invt=Ab4lyw&project=agones-images

ERROR: failed to solve: executor failed running [/bin/sh -c curl -L  ${HELM_URL} > /tmp/helm.tar.gz     && tar -zxvf /tmp/helm.tar.gz -C /tmp     && mv /tmp/linux-amd64/helm /usr/local/bin/helm     && chmod go+rx /usr/local/bin/helm     && rm /tmp/helm.tar.gz && rm -rf /tmp/linux-amd64]: exit code: 7
------
0.402 curl: (7) Failed to connect to get.helm.sh port 443 after 23 ms: Connection refused
0.386 
0.382                                  Dload  Upload   Total   Spent    Left  Speed
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
 > [ 6/11] RUN curl -L  https://get.helm.sh/helm-v3.5.0-linux-amd64.tar.gz > /tmp/helm.tar.gz     && tar -zxvf /tmp/helm.tar.gz -C /tmp     && mv /tmp/linux-amd64/helm /usr/local/bin/helm     && chmod go+rx /usr/local/bin/helm     && rm /tmp/helm.tar.gz && rm -rf /tmp/linux-amd64:
------
#10 ERROR: executor failed running [/bin/sh -c curl -L  ${HELM_URL} > /tmp/helm.tar.gz     && tar -zxvf /tmp/helm.tar.gz -C /tmp     && mv /tmp/linux-amd64/helm /usr/local/bin/helm     && chmod go+rx /usr/local/bin/helm     && rm /tmp/helm.tar.gz && rm -rf /tmp/linux-amd64]: exit code: 7
#10 0.402 curl: (7) Failed to connect to get.helm.sh port 443 after 23 ms: Connection refused

Haven't seen this one before. It might be an issue with Helm version, or possibly with the way the variable is declared: https://github.com/search?q=repo%3Agoogleforgames%2Fagones%20HELM_URL&type=code. In any case these setups should all be the same, and they've diverged.

@agones-bot
Copy link
Copy Markdown
Collaborator

Build Failed 😭

Build Id: 12efc4ea-908a-4a60-918a-1737916f3fd2

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@igooch
Copy link
Copy Markdown
Collaborator

igooch commented Aug 4, 2025

Build Failed 😭

Build Id: f97e2dc3-136a-4588-8fe7-d8fea6e93977

Status: FAILURE

The most recent failure is Error from server (NotFound): gameservers.agones.dev "sdk-client-test-hqrr5" not found.

We have a log (times are UTC) for deleting a job that was still running:

INFO 2025-08-04T16:25:04.707185589Z Step #25 - "submit-upgrade-test-cloud-build": Deleting job from previous run of upgrade-test-runner on cluster standard-upgrade-test-cluster-1-31.

from https://github.com/googleforgames/agones/blob/231024b332b723b6b60be3ed9057ef42715e85b5/build/e2e_upgrade_test.sh#L94

But no log for Deleting game servers from previous run of upgrade-test-runner on cluster from https://github.com/googleforgames/agones/blob/231024b332b723b6b60be3ed9057ef42715e85b5/build/e2e_upgrade_test.sh#L104

Which is odd. There's also no Running command delete gs -l app=sdk-client-test coming from within the test itself, which would be expected if the test itself was trying to delete the game server.
https://github.com/googleforgames/agones/blob/231024b332b723b6b60be3ed9057ef42715e85b5/test/upgrade/main.go#L240

In general though it seems like the failed test runs are not cleaning up properly.

For this PR I'd suggest that we wait for all non-unheathly game servers, including those in Shutdown state to finish before continuing. So probably doing the inverse, i.e. wait until there are no game servers or all known game servers are "Unhealthy".

@markmandel
Copy link
Copy Markdown
Collaborator

0.402 curl: (7) Failed to connect to get.helm.sh port 443 after 23 ms: Connection refused

That looks like get.helm.sh was just not playing nice. So external dependency being flaky. Not an us problem.

@markmandel
Copy link
Copy Markdown
Collaborator

Deleting agones-system namespace from previous run of upgrade-test-runner on cluster standard-upgrade-test-cluster-1-31.
namespace "agones-system" deleted
Error from server (InternalError): an error on the server ("Internal Server Error: \"/api/v1/namespaces/agones-system\": the server is currently unable to handle the request") has prevented the request from succeeding (get namespaces agones-system)

Is what I saw in the last log -- is that the issue? there was some issue deleting the namespace?

@markmandel
Copy link
Copy Markdown
Collaborator

/gcbrun

@agones-bot
Copy link
Copy Markdown
Collaborator

Build Succeeded 🥳

Build Id: d09c5852-6f8a-4875-b839-04954609024f

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4236/head:pr_4236 && git checkout pr_4236
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.52.0-dev-4d12197

Comment on lines +357 to +372
// Count Game Servers that are still in healthy states (not shutdown/terminated)
healthyCount := 0
for _, gs := range gameServers.Items {
state := gs.Status.State
// Consider Game Servers healthy if they're in Ready, Allocated, or Reserved states
if state == agonesv1.GameServerStateReady ||
state == agonesv1.GameServerStateAllocated ||
state == agonesv1.GameServerStateReserved {
healthyCount++
}
}

log.Printf("Waiting for %d healthy Game Servers to shutdown", healthyCount)

// Return true when no healthy Game Servers remain
return healthyCount == 0, nil
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Count Game Servers that are still in healthy states (not shutdown/terminated)
healthyCount := 0
for _, gs := range gameServers.Items {
state := gs.Status.State
// Consider Game Servers healthy if they're in Ready, Allocated, or Reserved states
if state == agonesv1.GameServerStateReady ||
state == agonesv1.GameServerStateAllocated ||
state == agonesv1.GameServerStateReserved {
healthyCount++
}
}
log.Printf("Waiting for %d healthy Game Servers to shutdown", healthyCount)
// Return true when no healthy Game Servers remain
return healthyCount == 0, nil
return len(gameServers) == 0, nil

Is what we're saying?

Out of curiosity - why do we wait for GameServers to be totally deleted?

Copy link
Copy Markdown
Collaborator Author

@lacroixthomas lacroixthomas Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, not fully sure anymore
The call just after is cleanUpResources(), which already delete gs etc.: https://github.com/googleforgames/agones/blob/231024b332b723b6b60be3ed9057ef42715e85b5/test/upgrade/main.go#L564C1-L592C1

If we want we can do some check there to ensure everything is cleaned-up in between steps, to avoid the uninstall failing / timing out if there is still gs running ? 🤔

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a guess, but the issue could be that:

We delete the GS using kubectl, they are in shutdown etc. Them we delete the helm release without waiting that it's done, which start but never finish and fails on delete namespace as there is still resources on it ?

The issue being that we dont wait between the delete gs and the unisntall of the helm release, we might set a finalizers and it never be deleted as the helm uninstall doesn't wait that the resources are cleaned up, this might be the explanation about why we want to wait that they are either non ready or deleted

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ill apply this changes on the clean resources fonction, to wait that the GS are cleaned up before uninstalling the helmrelease

Copy link
Copy Markdown
Collaborator Author

@lacroixthomas lacroixthomas Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we could just do a helm uninstall agones -n agones-system --wait --timeout which should wait that all resources are deleted, on need to "wait" for the game servers to be terminated manually

Copy link
Copy Markdown
Collaborator

@markmandel markmandel Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@igooch correct me if I'm wrong, but does that defeat the purpose of the test - which is to test in place upgrades.

But I've not dug that deep into what's going on in this test, so take what I say with a large grain of salt.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understood it's only after the test that runs the in place upgrades (runConfigWalker) which is called just before the cleanup: here, at the really end, once the test is done

But I might have missed something or miss-understood some part 😄

@lacroixthomas lacroixthomas force-pushed the bugfixes/fix-flacky-test-25 branch from 4d12197 to 376240f Compare August 9, 2025 00:50
@agones-bot
Copy link
Copy Markdown
Collaborator

Build Failed 😭

Build Id: e235f4bb-d7e1-4fec-8a1e-dcac0849ac2f

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Copy Markdown
Collaborator

Build Succeeded 🥳

Build Id: bdbba812-6297-47b1-870e-2be6367530de

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4236/head:pr_4236 && git checkout pr_4236
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.52.0-dev-c9bdecc

@markmandel
Copy link
Copy Markdown
Collaborator

Also possible I have no idea 😁

@markmandel
Copy link
Copy Markdown
Collaborator

/gcbrun

@agones-bot
Copy link
Copy Markdown
Collaborator

Build Failed 😭

Build Id: 7c27b5f6-077d-42f7-886c-90a5efdc99dd

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel
Copy link
Copy Markdown
Collaborator

Could also be there are multiple flakes in here.

@lacroixthomas
Copy link
Copy Markdown
Collaborator Author

This error might gonna be fixed by this PR : #4224

@lacroixthomas
Copy link
Copy Markdown
Collaborator Author

/gcbrun

@agones-bot
Copy link
Copy Markdown
Collaborator

Build Succeeded 🥳

Build Id: 6d155009-00dc-43b2-8499-de68fad42baf

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4236/head:pr_4236 && git checkout pr_4236
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.52.0-dev-c9bdecc

Copy link
Copy Markdown
Collaborator

@igooch igooch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@igooch igooch merged commit 9e74311 into agones-dev:main Aug 12, 2025
4 checks passed
@lacroixthomas lacroixthomas mentioned this pull request Aug 13, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Flacky pipeline]: Flaky test from submit-upgrade-test-cloud-build

4 participants