Skip to content

Fleet autoscaler threads maintain state#4277

Merged
markmandel merged 3 commits intoagones-dev:mainfrom
markmandel:feature/autoscaler-state
Sep 17, 2025
Merged

Fleet autoscaler threads maintain state#4277
markmandel merged 3 commits intoagones-dev:mainfrom
markmandel:feature/autoscaler-state

Conversation

@markmandel
Copy link
Copy Markdown
Collaborator

What type of PR is this?

Uncomment only one /kind <> line, press enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking
/kind bug
/kind cleanup
/kind documentation

/kind feature

/kind hotfix
/kind release

What this PR does / Why we need it:

This is necessary refactoring pre-work for #4080.

We need somewhere to store the wasm plugin when loaded for each fleet autoscaler CRD instance that is defined. This change both provides that arbitrary state as part of fastThread (which now means any new autoscaling ability can have state! Someone want to do predictive math over time?), and also having a ctx pass through since anything with state will likely need to be closed (the wasm plugins need it) at shutdown time.

Which issue(s) this PR fixes:

Work on #4080

Special notes for your reviewer:

N/A

This is necessary refactoring pre-work for agones-dev#4080.

We need somewhere to store the wasm plugin when loaded for each fleet
autoscaler CRD instance that is defined. This change both provides that
arbitrary state as part of `fastThread` (which now means any new
autoscaling ability can have state! Someone want to do predictive math
over time?), and also having a ctx pass through since anything with
state will likely need to be closed (the wasm plugins need it)
at shutdown time.

Work on agones-dev#4080
@github-actions github-actions bot added kind/feature New features for Agones size/S labels Sep 14, 2025
@agones-bot
Copy link
Copy Markdown
Collaborator

Build Failed 😭

Build Id: 6f7f1a4f-eff8-49cf-bcd4-f8912d4041f7

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel
Copy link
Copy Markdown
Collaborator Author

markmandel commented Sep 15, 2025

flake - autopilot

VERBOSE: time="2025-09-14 23:57:07.444" level=info msg="Event!" lastTimestamp="2025-09-14 23:57:03 +0000 UTC" message="Stopping container agones-gameserver-sidecar" reason=Killing test=TestSuperTuxKartGameServerReady type=Normal
VERBOSE:     examples_test.go:83: 
VERBOSE:         	Error Trace:	/go/src/agones.dev/agones/test/e2e/examples_test.go:83
VERBOSE:         	Error:      	Received unexpected error:
VERBOSE:         	            	waiting for {supertuxkart [{default  Dynamic <nil> 8080 0 UDP}] {false 60 0 30}  { 0 0} {{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []} {[] [] [{supertuxkart us-docker.pkg.dev/agones-images/examples/supertuxkart-example:0.19 [] []  [] [] [{ENABLE_PLAYER_TRACKING false nil}] {map[] map[] []} [] <nil> [] [] nil nil nil nil    nil false false false}] []  <nil> <nil>  map[]   <nil>  false false false <nil> nil []   nil  [] []  <nil> nil [] <nil> <nil> <nil> map[] [] <nil> nil <nil> [] [] nil}} <nil> map[] map[] <nil>} GameServer instance readiness timed out (): waiting for GameServer 1757894215/supertuxkart-cxbkq to be Ready: GameServer reached terminal state Unhealthy
VERBOSE:         	Test:       	TestSuperTuxKartGameServerReady
VERBOSE: --- FAIL: TestSuperTuxKartGameServerReady (10.99s)

@markmandel
Copy link
Copy Markdown
Collaborator Author

/gcbrun

@agones-bot
Copy link
Copy Markdown
Collaborator

Build Failed 😭

Build Id: 66082f3b-5a00-4136-84e4-ede7e6b06a90

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel
Copy link
Copy Markdown
Collaborator Author

Step #19 - "tests": --- FAIL: TestControllerSyncFleetAutoscaler (0.00s)
Step #19 - "tests":     --- FAIL: TestControllerSyncFleetAutoscaler/wrong_policy (0.69s)
Step #19 - "tests":         controller_test.go:694: 
Step #19 - "tests":             	Error Trace:	/go/src/agones.dev/agones/pkg/fleetautoscalers/controller_test.go:694
Step #19 - "tests":             	Error:      	Not equal: 
Step #19 - "tests":             	            	expected: "error calculating autoscaling fleet: fleet-1: wrong policy type, should be one of: Buffer, Webhook, Counter, List, Schedule, Chain"
Step #19 - "tests":             	            	actual  : "There should be a fasThread for the FleetAutoscaler, but it was not found"
Step #19 - "tests":             	            	
Step #19 - "tests":             	            	Diff:
Step #19 - "tests":             	            	--- Expected
Step #19 - "tests":             	            	+++ Actual
Step #19 - "tests":             	            	@@ -1 +1 @@
Step #19 - "tests":             	            	-error calculating autoscaling fleet: fleet-1: wrong policy type, should be one of: Buffer, Webhook, Counter, List, Schedule, Chain
Step #19 - "tests":             	            	+There should be a fasThread for the FleetAutoscaler, but it was not found
Step #19 - "tests":             	Test:       	TestControllerSyncFleetAutoscaler/wrong_policy
Step #19 - "tests": {"message":"Wait for cache sync","severity":"info","time":"2025-09-15T01:08:09.745284288Z"}

I'll look into this, seems legit.

@markmandel
Copy link
Copy Markdown
Collaborator Author

Ooh fun, this is a flake I think's been here for quite a while, but I can definitely fix it.

@agones-bot
Copy link
Copy Markdown
Collaborator

Build Succeeded 🥳

Build Id: 87d9b130-6249-4324-86ac-0896c0427970

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4277/head:pr_4277 && git checkout pr_4277
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.53.0-dev-f4bc9ac

c.baseLogger = runtime.NewLoggerWithType(c)
c.workerqueue = workerqueue.NewWorkerQueueWithRateLimiter(c.syncFleetAutoscaler, c.baseLogger, logfields.FleetAutoscalerKey, autoscaling.GroupName+".FleetAutoscalerController", workerqueue.FastRateLimiter(3*time.Second))
health.AddLivenessCheck("fleetautoscaler-workerqueue", healthcheck.Check(c.workerqueue.Healthy))
health.AddLivenessCheck("fleetautoscaler-workerqueue", c.workerqueue.Healthy)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small cleanup I slipped in

@agones-bot
Copy link
Copy Markdown
Collaborator

Build Failed 😭

Build Id: 247efc9f-8667-4e79-8438-2fe39296940a

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@lacroixthomas
Copy link
Copy Markdown
Collaborator

Seems that the test was going without error but it reached a timeout of 10mins, not sure why 🤔

/gcbrun

@markmandel
Copy link
Copy Markdown
Collaborator Author

=== FAIL: test/e2e/allocator  (0.00s)
time="2025-09-16 18:16:04.805" level=warning msg="Error creating inClusterConfig, trying to build config from flagsunable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined" error="unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined" source=framework
time="2025-09-16 18:16:04.817" level=info msg="Starting e2e test(s)" cloudProduct=gke-autopilot featureGates="AutopilotPassthroughPort=true&CountsAndLists=true&DisableResyncOnSDKServer=true&Example=false&FleetAutoscaleRequestMetaData=false&GKEAutopilotExtendedDurationPods=true&PlayerAllocationFilter=false&PlayerTracking=false&PortPolicyNone=true&PortRanges=true&ProcessorAllocator=false&RollingUpdateFix=true&ScheduledAutoscaler=true&SidecarContainers=false&WasmAutoscaler=false" gameServerImage="us-docker.pkg.dev/agones-images/examples/simple-game-server:0.39" namespace= perfOutputDir= pullSecret= stressTestLevel=0 version=
time="2025-09-16 18:16:05.042" level=info msg="Custom namespace is set: 1758046565"
time="2025-09-16 18:16:05.131" level=info msg="Namespace 1758046565 is created"
time="2025-09-16 18:16:05.226" level=info msg="ServiceAccount 1758046565/agones-sdk is created"
time="2025-09-16 18:16:05.373" level=info msg="RoleBinding 1758046565/agones-sdk-access is created"
panic: test timed out after 10m0s
	running tests:
		TestAllocatorAfterDeleteReplica (10m0s)

goroutine 122 [running]:
testing.(*M).startAlarm.func1()
	/usr/local/go/src/testing/testing.go:2484 +0x605
created by time.goFunc
	/usr/local/go/src/time/sleep.go:215 +0x45

goroutine 1 [chan receive, 10 minutes]:
testing.(*T).Run(0xc000297340, {0x2f7011f, 0x1f}, 0x304ca20)
	/usr/local/go/src/testing/testing.go:1859 +0x91e
testing.runTests.func1(0xc000297340)
	/usr/local/go/src/testing/testing.go:2279 +0x86
testing.tRunner(0xc000297340, 0xc0005b58c8)
	/usr/local/go/src/testing/testing.go:1792 +0x226
testing.runTests(0xc000506528, {0x4463860, 0x1, 0x1}, {0xc000080158?, 0x0?, 0x4498e40?})
	/usr/local/go/src/testing/testing.go:2277 +0x96d
testing.(*M).Run(0xc00050d040)
	/usr/local/go/src/testing/testing.go:2142 +0xeeb
agones.dev/agones/test/e2e/allocator.TestMain(0xc00050d040)
	/go/src/agones.dev/agones/test/e2e/allocator/main_test.go:91 +0xa85
main.main()
	_testmain.go:47 +0x172

(I often post flakes in here, so if you get one, you can search the repo to see how often it happens)

@agones-bot
Copy link
Copy Markdown
Collaborator

Build Succeeded 🥳

Build Id: 2209431c-823c-4c67-bc32-1e953d17659d

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4277/head:pr_4277 && git checkout pr_4277
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.53.0-dev-1d5863a

Copy link
Copy Markdown
Collaborator

@lacroixthomas lacroixthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@markmandel markmandel merged commit 4e466c6 into agones-dev:main Sep 17, 2025
4 checks passed
@markmandel markmandel deleted the feature/autoscaler-state branch September 17, 2025 22:36
mnthe pushed a commit to mnthe/agones that referenced this pull request Mar 23, 2026
* Fleet autoscaler threads maintain state

This is necessary refactoring pre-work for agones-dev#4080.

We need somewhere to store the wasm plugin when loaded for each fleet
autoscaler CRD instance that is defined. This change both provides that
arbitrary state as part of `fastThread` (which now means any new
autoscaling ability can have state! Someone want to do predictive math
over time?), and also having a ctx pass through since anything with
state will likely need to be closed (the wasm plugins need it)
at shutdown time.

Work on agones-dev#4080

* Fix for flaky unit test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature New features for Agones size/M size/S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants