-
Notifications
You must be signed in to change notification settings - Fork 898
MaxSurge and MinAvailable calculated incorrectly #3688
Description
What happened:
We see a very sharp decline in Ready GameServers when rolling over to a new GameServerSet, leading to having no Ready GameServers in the pool while the new version is stuck in Scheduled due to the nodes downloading images.
What you expected to happen:
A smooth rollout with Ready GameServers always available.
How to reproduce it (as minimally and precisely as possible):
When we investigated this, it seems there is a difference between what the docs say here: https://agones.dev/site/docs/guides/fleet-updates/#rolling-update-strategy
By default, a Fleet will wait for new GameSevers to become Ready during a Rolling Update before continuing to shutdown additional GameServers, only counting GameServers that are Ready as being available when calculating the current maxUnavailable value which controls the rate at which GameServers are updated.
and what we see in the calculation for MinAvailable:
https://github.com/googleforgames/agones/blob/main/pkg/fleets/controller.go#L555
It also stands to reason the MaxSurge is calculated in relation to how MinAvailable is, and should probably not be based on fleet.Spec.Replicas either.
With some real world numbers from one of our clusters, we have:
Replicas: 3.46K
ReadyReplicas: 194
AllocatedReplicas: 3.22K
MaxSurge: 10%
MinAvailable: 10%
With the current calculation it is allowed to terminate 346 GameServers in the first round, but we only have 194 Ready, so all Ready GameServers are terminated. Were this based on Ready GameServers, it would be allowed to terminate 19, which is a rather large difference.
Anything else we need to know?:
Another thing that struck me as odd in the calculation is cleanupUnhealthyReplicas:
https://github.com/googleforgames/agones/blob/main/pkg/fleets/controller.go#L529
This seems to consider everything that is not Ready as unhealthy but seems to ignore Allocated which strikes me as wrong.
It might also be a good idea to set Replicas to 0 on a non-active GameServerSet when Allocated == Replicas.
Environment:
- Agones version: 1.37.0
- Kubernetes version (use
kubectl version): 1.27.3-gke.100 - Cloud provider or hardware configuration: GKE n2-custom-40-163840
- Install method (yaml/helm): helm
- Troubleshooting guide log(s):
- Others: