Skip to content

MaxSurge and MinAvailable calculated incorrectly #3688

@nrwiersma

Description

@nrwiersma

What happened:

We see a very sharp decline in Ready GameServers when rolling over to a new GameServerSet, leading to having no Ready GameServers in the pool while the new version is stuck in Scheduled due to the nodes downloading images.

What you expected to happen:

A smooth rollout with Ready GameServers always available.

How to reproduce it (as minimally and precisely as possible):

When we investigated this, it seems there is a difference between what the docs say here: https://agones.dev/site/docs/guides/fleet-updates/#rolling-update-strategy

By default, a Fleet will wait for new GameSevers to become Ready during a Rolling Update before continuing to shutdown additional GameServers, only counting GameServers that are Ready as being available when calculating the current maxUnavailable value which controls the rate at which GameServers are updated.

and what we see in the calculation for MinAvailable:

https://github.com/googleforgames/agones/blob/main/pkg/fleets/controller.go#L555

It also stands to reason the MaxSurge is calculated in relation to how MinAvailable is, and should probably not be based on fleet.Spec.Replicas either.

With some real world numbers from one of our clusters, we have:

Replicas: 3.46K
ReadyReplicas: 194
AllocatedReplicas: 3.22K
MaxSurge: 10%
MinAvailable: 10%

With the current calculation it is allowed to terminate 346 GameServers in the first round, but we only have 194 Ready, so all Ready GameServers are terminated. Were this based on Ready GameServers, it would be allowed to terminate 19, which is a rather large difference.

Anything else we need to know?:

Another thing that struck me as odd in the calculation is cleanupUnhealthyReplicas:

https://github.com/googleforgames/agones/blob/main/pkg/fleets/controller.go#L529

This seems to consider everything that is not Ready as unhealthy but seems to ignore Allocated which strikes me as wrong.

It might also be a good idea to set Replicas to 0 on a non-active GameServerSet when Allocated == Replicas.

Environment:

  • Agones version: 1.37.0
  • Kubernetes version (use kubectl version): 1.27.3-gke.100
  • Cloud provider or hardware configuration: GKE n2-custom-40-163840
  • Install method (yaml/helm): helm
  • Troubleshooting guide log(s):
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions