Skip to content

Pause Single Cluster Upgrade work until stable.#4257

Merged
markmandel merged 3 commits intoagones-dev:mainfrom
markmandel:remove/in-place-upgrades
Sep 2, 2025
Merged

Pause Single Cluster Upgrade work until stable.#4257
markmandel merged 3 commits intoagones-dev:mainfrom
markmandel:remove/in-place-upgrades

Conversation

@markmandel
Copy link
Copy Markdown
Collaborator

What type of PR is this?

Uncomment only one /kind <> line, press enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking
/kind bug

/kind cleanup

/kind documentation
/kind feature
/kind hotfix
/kind release

What this PR does / Why we need it:

For ~6 months the upgrade CI has been flaky/broken, making it unreliable and slowing community contribution and overall project momentum.

This change:

  • removes the build/push + submission steps for upgrade tests from cloudbuild.yaml to reduce noise and unblock CI reliability;
  • updates the upgrading guide to clearly state that in-place upgrades are on hiatus due to lack of reliable testing and were removed from CI, and recommends thorough testing (multi-cluster remains the recommended production strategy).

Which issue(s) this PR fixes:

N/A

Special notes for your reviewer:

We should re-enable upgrade tests once they can have someone dedicated to the workstream again, and they can run reliably and provide signal again.

With CI being this unstable we can't actually guarantee this functionality actually works at this stage anyway, so I don't think there's any reason to keep it running in CI.

@github-actions github-actions bot added kind/cleanup Refactoring code, fixing up documentation, etc size/S labels Aug 26, 2025
Copy link
Copy Markdown
Collaborator

@igooch igooch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make more sense to add a timeout and default to passing like something below, so that we retain the logs and dev can continue.

  # Run the upgrade tests parallel, pass this step if any of the tests fail
  - name: gcr.io/google.com/cloudsdktool/cloud-sdk
    id: submit-upgrade-test-cloud-build
    entrypoint: bash
    args:
      - -c
      - "./build/e2e_upgrade_test.sh ${_BASE_VERSION} ${PROJECT_ID} || true"
    waitFor:
      - wait-to-become-leader
      - push-upgrade-test
    timeout: 3600s # 1h

@agones-bot
Copy link
Copy Markdown
Collaborator

Build Succeeded 🥳

Build Id: 5c82eddd-6534-4d3a-8a9a-206feedebecf

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4257/head:pr_4257 && git checkout pr_4257
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.52.0-dev-fb14abb

@markmandel
Copy link
Copy Markdown
Collaborator Author

It would make more sense to add a timeout and default to passing like something below, so that we retain the logs and dev can continue.

Don't hate this idea. Looking at logs, it usually takes 20m to pass, so we happy with a 30m timeout?

@igooch
Copy link
Copy Markdown
Collaborator

igooch commented Aug 28, 2025

It would make more sense to add a timeout and default to passing like something below, so that we retain the logs and dev can continue.

Don't hate this idea. Looking at logs, it usually takes 20m to pass, so we happy with a 30m timeout?

Yep, 30 min should work.

@markmandel
Copy link
Copy Markdown
Collaborator Author

Not to self, this would actually need to be:

  # Run the upgrade tests parallel, pass this step if any of the tests fail
  - name: gcr.io/google.com/cloudsdktool/cloud-sdk
    id: submit-upgrade-test-cloud-build
    entrypoint: bash
    args:
      - -c
      - "timeout 30m ./build/e2e_upgrade_test.sh ${_BASE_VERSION} ${PROJECT_ID} || true"
    waitFor:
      - wait-to-become-leader
      - push-upgrade-test

Otherwise the timeout command will fail the build.

@markmandel markmandel marked this pull request as ready for review September 1, 2025 20:53
For ~6 months the upgrade CI has been flaky/broken, making it unreliable
and slowing community contribution and overall project momentum.

This change:
- removes the build/push + submission steps for upgrade tests from
cloudbuild.yaml to reduce noise and unblock CI reliability;
- updates the upgrading guide to clearly state that in-place upgrades
are on hiatus due to lack of reliable testing and were removed from CI,
and recommends thorough testing (multi-cluster remains the recommended
production strategy).

We should re-enable upgrade tests once they can have someone dedicated
to the workstream again, and they can run reliably and provide
signal again.

With CI being this unstable we can't actually guarantee this
functionality actually works at this stage anyway, so I don't think
there's any reason to keep it running in CI.
@markmandel markmandel force-pushed the remove/in-place-upgrades branch from 3b746ed to e9a3a6a Compare September 1, 2025 20:54
@agones-bot
Copy link
Copy Markdown
Collaborator

Build Failed 😭

Build Id: 95f2f818-7cd3-4f82-8fc6-53db02752357

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel markmandel force-pushed the remove/in-place-upgrades branch from e9a3a6a to fffda34 Compare September 1, 2025 21:57
@markmandel
Copy link
Copy Markdown
Collaborator Author

Flakingess in counter scripts on autopilot cluster - pod went unhelathy?:

https://console.cloud.google.com/cloud-build/builds/699ba78b-4caf-4dfe-a2e5-4900bf3b99d9;step=2?project=agones-images

VERBOSE: time="2025-09-01 21:49:55.978" level=info msg="SetCounterCount Past Zero" fields.msg="SET_COUNTER_COUNT games -1"
VERBOSE: time="2025-09-01 21:49:56.133" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:50825->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:49:58.134" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:39884->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:00.134" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:59267->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:02.134" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:45410->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:04.134" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:49268->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:06.134" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:53658->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:08.135" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:59184->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:10.135" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:41287->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:12.135" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:33023->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:14.133" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:53879->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:16.133" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:42275->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:18.134" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:51739->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:20.134" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:55093->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:22.134" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:53925->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:24.135" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:36847->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:26.134" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:51650->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:28.134" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:55716->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:30.134" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:44328->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:32.134" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:58613->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:34.134" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:37723->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:36.134" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:52208->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:38.134" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:56396->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:40.133" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:58684->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:42.133" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:43051->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:44.134" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:50042->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:46.135" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:42834->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:48.135" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:34494->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:50.135" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:56977->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:52.135" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:33035->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:54.135" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:50706->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:56.134" level=info msg="Could not read from address" address="34.81.191.23:7505" error="read udp 192.168.10.2:37899->34.81.191.23:7505: read: connection refused" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:56.135" level=info msg="Failed to send UDP packet to GameServer. Dumping Events!" gs=game-serverj7x8c status="{State:Ready Ports:[{Name:udp-port Port:7505}] Address:34.81.191.23 Addresses:[{Type:InternalIP Address:10.140.0.114} {Type:ExternalIP Address:34.81.191.23} {Type:Hostname Address:gk3-gke-autopilot-e2e-te-nap-98oy711m-a4b23ea8-c2sn} {Type:PodIP Address:10.18.132.69}] NodeName:gk3-gke-autopilot-e2e-te-nap-98oy711m-a4b23ea8-c2sn ReservedUntil:<nil> Players:<nil> Counters:map[bar:{Count:10 Capacity:10} baz:{Count:1000 Capacity:1000} foo:{Count:10 Capacity:100} games:{Count:1 Capacity:50} qux:{Count:42 Capacity:50}] Lists:map[] Eviction:0xc0009612a0}" test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:56.135" level=info msg="Dumping Events:" kind= test=TestCounters/SetCounterCount_Past_Zero
VERBOSE: time="2025-09-01 21:50:56.646" level=info msg="Event!" lastTimestamp="2025-09-01 21:33:17 +0000 UTC" message="Port allocated" reason=PortAllocation test=TestCounters/SetCounterCount_Past_Zero type=Normal
VERBOSE: time="2025-09-01 21:50:56.646" level=info msg="Event!" lastTimestamp="2025-09-01 21:33:17 +0000 UTC" message="Pod game-serverj7x8c created" reason=Creating test=TestCounters/SetCounterCount_Past_Zero type=Normal
VERBOSE: time="2025-09-01 21:50:56.647" level=info msg="Event!" lastTimestamp="2025-09-01 21:33:42 +0000 UTC" message="SDK state change" reason=RequestReady test=TestCounters/SetCounterCount_Past_Zero type=Normal
VERBOSE: time="2025-09-01 21:50:56.647" level=info msg="Event!" lastTimestamp="2025-09-01 21:33:42 +0000 UTC" message="Address and port populated" reason=Ready test=TestCounters/SetCounterCount_Past_Zero type=Normal
VERBOSE: time="2025-09-01 21:50:56.647" level=info msg="Event!" lastTimestamp="2025-09-01 21:33:42 +0000 UTC" message="SDK.Ready() complete" reason=Ready test=TestCounters/SetCounterCount_Past_Zero type=Normal
VERBOSE: time="2025-09-01 21:50:56.647" level=info msg="Event!" lastTimestamp="2025-09-01 21:34:13 +0000 UTC" message="Issue with Gameserver pod" reason=Unhealthy test=TestCounters/SetCounterCount_Past_Zero type=Warning
VERBOSE:     gameserver_test.go:1575: 
VERBOSE:         	Error Trace:	/go/src/agones.dev/agones/test/e2e/gameserver_test.go:1575
VERBOSE:         	Error:      	Received unexpected error:
VERBOSE:         	            	context deadline exceeded
VERBOSE:         	            	timed out attempting to send UDP packet to address
VERBOSE:         	            	agones.dev/agones/test/e2e/framework.(*Framework).SendUDP
VERBOSE:         	            		/go/src/agones.dev/agones/test/e2e/framework/framework.go:601
VERBOSE:         	            	agones.dev/agones/test/e2e/framework.(*Framework).SendGameServerUDPToPort
VERBOSE:         	            		/go/src/agones.dev/agones/test/e2e/framework/framework.go:552
VERBOSE:         	            	agones.dev/agones/test/e2e/framework.(*Framework).SendGameServerUDP
VERBOSE:         	            		/go/src/agones.dev/agones/test/e2e/framework/framework.go:532
VERBOSE:         	            	agones.dev/agones/test/e2e.TestCounters.func1
VERBOSE:         	            		/go/src/agones.dev/agones/test/e2e/gameserver_test.go:1574
VERBOSE:         	            	testing.tRunner
VERBOSE:         	            		/usr/local/go/src/testing/testing.go:1792
VERBOSE:         	            	runtime.goexit
VERBOSE:         	            		/usr/local/go/src/runtime/asm_amd64.s:1700
VERBOSE:         	Test:       	TestCounters/SetCounterCount_Past_Zero
VERBOSE: --- FAIL: TestCounters/SetCounterCount_Past_Zero (60.67s)
VERBOSE: --- FAIL: TestCounters (1062.94s)

@agones-bot
Copy link
Copy Markdown
Collaborator

Build Succeeded 🥳

Build Id: 038006cd-9162-41cb-b057-01dbc9ed6022

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4257/head:pr_4257 && git checkout pr_4257
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.52.0-dev-fffda34

@markmandel
Copy link
Copy Markdown
Collaborator Author

Should be good to go 🤞🏻

@agones-bot
Copy link
Copy Markdown
Collaborator

Build Succeeded 🥳

Build Id: d37de1a5-4d90-42a7-b02c-3ff8f632ade5

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4257/head:pr_4257 && git checkout pr_4257
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.52.0-dev-0e9c266

Copy link
Copy Markdown
Collaborator

@lacroixthomas lacroixthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@markmandel markmandel merged commit 8d82ce7 into agones-dev:main Sep 2, 2025
4 checks passed
@markmandel markmandel deleted the remove/in-place-upgrades branch September 2, 2025 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/cleanup Refactoring code, fixing up documentation, etc size/S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants