Skip to content

RollingUpdate maxUnavailable should ensure to allow new allocations #3438

@Ullaakut

Description

@Ullaakut

What happened:

The way the maxUnavailable parameter is currently implemented, as part of the RollingUpdate strategy, is counterintuitive and can lead to situations where:

  • new allocations are made on outdated game server sets
  • ready replicas in outdated game server sets are not reaped

What you expected to happen:

  1. maxUnavailable should ensure to always be able to serve new allocations. The number of already allocated game servers should not influence this calculation, as they are not relevant to whether or not new allocations can be made.
  2. New allocations should be prioritized on active game server sets over outdated ones.
  3. Outdated game server sets should allow their ready replicas to be reaped to let the active game server set reach its desired replica count and finish the rolling update.

How to reproduce it (as minimally and precisely as possible):

Install Agones

  1. kubectl create namespace agones-system
  2. kubectl apply --server-side -f https://raw.githubusercontent.com/googleforgames/agones/release-1.35.0/install/yaml/install.yaml

Create a Fleet

  1. kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/release-1.35.0/examples/simple-game-server/fleet.yaml
  2. Verify that the Fleet has 25% maxSurge and maxUnavailable by running k get fleets.agones.dev/simple-game-server -o json | jq ".spec.strategy"
    {
      "rollingUpdate": {
        "maxSurge": "25%",
        "maxUnavailable": "25%"
      },
      "type": "RollingUpdate"
    }
  3. Set the number of replicas to 10: kubectl edit fleets.agones.dev/simple-game-server
  4. Cause 3 allocations by running the following command: kubectl create -f https://raw.githubusercontent.com/googleforgames/agones/release-1.35.0/examples/simple-game-server/gameserverallocation.yaml -o yaml

Trigger a RollingUpdate

  1. Run watch kubectl get gameserversets to make sure that the game servers are currently ready and not allocated.
    k get gameservers
    NAME                   SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
    simple-game-server-1   Packed       10        10        3           7       1m
  2. Run kubectl edit fleets.agones.dev/simple-game-server again, to change the game server version to 0.17 instead of 0.18.
  3. Trigger 3 more allocations with kubectl create -f https://raw.githubusercontent.com/googleforgames/agones/release-1.35.0/examples/simple-game-server/gameserverallocation.yaml -o yaml. The game server sets should look something like this:
    NAME                   SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
    simple-game-server-1   Packed       0         3         3           0       2m
    simple-game-server-2   Packed       7         7         3           4       1m
  4. Run kubectl edit fleets.agones.dev/simple-game-server one last time, to change the game server version to 0.16 instead of 0.17 to trigger one more rolling update.
  5. The game server sets now demonstrate the issue. simple-game-server-2 remains with a hanging ready replica, which never gets reaped, even though that would bring the fleet back to its intended replica count and finish the update.
    NAME                   SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
    simple-game-server-1   Packed       0         3         3           0       3m
    simple-game-server-2   Packed       4         4         3           1       2m
    simple-game-server-3   Packed       4         4         0           4       1m

Fix the broken RollingUpdate by updating the maxUnavailable value

Edit the update strategy to allow the game server set to scale down.

  1. kubectl edit fleets.agones.dev/simple-game-server
  2. Change the value of maxUnavailable to 80%
  3. Observe that the rolling update now works successfully, and the old game server states see their DESIRED replica count get to zero, and all of their READY replicas get reaped.
    NAME                  SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
    simple-game-server-1   Packed       0         3         3           0       4m
    simple-game-server-2   Packed       0         3         3           0       3m
    simple-game-server-3   Packed       4         4         0           4       2m

Anything else we need to know?:

In the following code snippet, fleet refers to the Fleet specification we have been editing, while active refers to the GameServerSet currently being updated and rest refers to the other GameServerSets that are not being updated.

Let's consider the case where there are the following three game server sets:

NAME                   SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-game-server-1   Packed       0         3         3           0       3m
simple-game-server-2   Packed       4         4         3           1       2m
simple-game-server-3   Packed       4         4         0           4       1m

When rollingUpdateRestFixedOnReady is called, simple-game-server-3 is the active game server set, and simple-game-server-1 and simple-game-server-2 as part of the rest, the following happens:

  1. r, err := intstr.GetValueFromIntOrPercent(fleet.Spec.Strategy.RollingUpdate.MaxUnavailable, int(fleet.Spec.Replicas), false) returns .25 * 10 rounded down, which is 2. This is then stored in the unavailable variable as an int32.
  2. readyReplicasCount is 5, since there are a total of two ready replicas across game server sets.
  3. minAvailable is 8 (fleet.Spec.Replicas - unavailable, so 10 - 2)
  4. cleanupUnhealthyReplicas is then called, which does nothing since the ready replicas on outdated game server sets aren't considered unhealthy.
  5. Then, the statement if readyReplicasCount <= minAvailable { return nil } causes the method to return without scaling down the ready replicas from the outdated game server set.
  6. If maxUnavailable is then set to 80%, this changes the equation, since now minAvailable is 2 instead of 8, which makes this if statement no longer true.

Here is the relevant code snippet, with irrelevant code to this issue commented out.

package fleets
// from pkg/fleets/controller.go

func (c *Controller) rollingUpdateRestFixedOnReady(ctx context.Context, fleet *agonesv1.Fleet, active *agonesv1.GameServerSet, rest []*agonesv1.GameServerSet) error {
   // if len(rest) == 0 {
   //    return nil
   // }

   r, err := intstr.GetValueFromIntOrPercent(fleet.Spec.Strategy.RollingUpdate.MaxUnavailable, int(fleet.Spec.Replicas), false)
   if err != nil {
      return errors.Wrapf(err, "error parsing MaxUnavailable value: %s", fleet.ObjectMeta.Name)
   }
   // if r == 0 {
   //    r = 1
   // }
   // if r > int(fleet.Spec.Replicas) {
   //    r = int(fleet.Spec.Replicas)
   // }
   unavailable := int32(r)

   // totalAlreadyScaledDown := int32(0)
   // totalScaleDownCount := int32(0)
   allGSS := rest
   allGSS = append(allGSS, active)
   readyReplicasCount := agonesv1.GetReadyReplicaCountForGameServerSets(allGSS)
   minAvailable := fleet.Spec.Replicas - unavailable

   // allPodsCount := agonesv1.SumSpecReplicas(allGSS)
   // newGSSUnavailablePodCount := active.Spec.Replicas - active.Status.ReadyReplicas - active.Status.AllocatedReplicas
   // maxScaledDown := allPodsCount - minAvailable - newGSSUnavailablePodCount

   // if maxScaledDown <= 0 {
   //    return nil
   // }
   rest, _, err = c.cleanupUnhealthyReplicas(ctx, rest, fleet, maxScaledDown)
   if err != nil {
      return nil
   }
   // Resulting value is readyReplicasCount + unavailable - fleet.Spec.Replicas
   // totalScaleDownCount = readyReplicasCount - minAvailable
   if readyReplicasCount <= minAvailable {
      // Cannot scale down.
      return nil
   }
   // The rest of this method is irrelevant since in the case of this issue we always return in the above statement.
   return nil
}

func (c *Controller) cleanupUnhealthyReplicas(ctx context.Context, rest []*agonesv1.GameServerSet,
        fleet *agonesv1.Fleet, maxCleanupCount int32) ([]*agonesv1.GameServerSet, int32, error) {

   totalScaledDown := int32(0)
   for i, gsSet := range rest {
      if totalScaledDown >= maxCleanupCount {
         break
      }
      if gsSet.Spec.Replicas == 0 {
         continue
      }
      if gsSet.Spec.Replicas == gsSet.Status.ReadyReplicas {
         continue
      }

      // scaledDownCount := int32(integer.IntMin(int(maxCleanupCount-totalScaledDown), int(gsSet.Spec.Replicas-gsSet.Status.ReadyReplicas)))
      // newReplicasCount := gsSet.Spec.Replicas - scaledDownCount
      // if newReplicasCount > gsSet.Spec.Replicas {
      //    return nil, 0, fmt.Errorf("when cleaning up unhealthy replicas, got invalid request to scale down %s/%s %d -> %d", gsSet.Namespace, gsSet.Name, gsSet.Spec.Replicas, newReplicasCount)
      // }
	  // 
      // gsSetCopy := gsSet.DeepCopy()
      // gsSetCopy.Spec.Replicas = newReplicasCount
      // totalScaledDown += scaledDownCount
      // if _, err := c.gameServerSetGetter.GameServerSets(gsSetCopy.ObjectMeta.Namespace).Update(ctx, gsSetCopy, metav1.UpdateOptions{}); err != nil {
      //    return nil, totalScaledDown, errors.Wrapf(err, "error updating gameserverset %s", gsSetCopy.ObjectMeta.Name)
      // }
	  // 
      // rest[i] = gsSetCopy
   }
   return rest, totalScaledDown, nil
}

When this happens, it is even possible that a new allocation selects the ready replica from the outdated game server set, then causing the following state to occur:

Before Allocation

NAME                   SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-game-server-1   Packed       0         3         3           0       3m
simple-game-server-2   Packed       4         4         3           1       2m
simple-game-server-3   Packed       4         4         0           4       1m

After Allocation

NAME                   SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-game-server-1   Packed       0         3         3           0       3m
simple-game-server-2   Packed       4         4         4           0       2m
simple-game-server-3   Packed       3         3         0           3       1m

The latest game server state lost one of its desired replicas in the process! In cases where the old and new game server sets compete for the last available replica, this leads to a situation where the new game server set disappears in favor of the old one, like here:

NAME                   SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-game-server-1   Packed       3         6         6           0       12m
simple-game-server-2   Packed       4         4         4           0       11m
simple-game-server-3   Packed       0         0         0           0       10m

This issue can also be reproduced consistently with a Fleet Autoscaler, except that the case mentioned in the previous paragraph wouldn't happen, as long as the maxReplicas isn't reached.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions