RollingUpdate `maxUnavailable` should ensure to allow new allocations

**What happened**:

The way the `maxUnavailable` parameter is currently implemented, as part of the RollingUpdate strategy, is counterintuitive and can lead to situations where:

* new allocations are made on outdated game server sets
* ready replicas in outdated game server sets are not reaped

**What you expected to happen**:

1. `maxUnavailable` should ensure to always be able to serve new allocations. The number of already allocated game servers should not influence this calculation, as they are not relevant to whether or not new allocations can be made.
2. New allocations should be prioritized on active game server sets over outdated ones.
3. Outdated game server sets should allow their ready replicas to be reaped to let the active game server set reach its desired replica count and finish the rolling update.

**How to reproduce it (as minimally and precisely as possible)**:

### Install Agones

 1. `kubectl create namespace agones-system`
 2. `kubectl apply --server-side -f https://raw.githubusercontent.com/googleforgames/agones/release-1.35.0/install/yaml/install.yaml`
 
### Create a Fleet

 1. `kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/release-1.35.0/examples/simple-game-server/fleet.yaml`
 4. Verify that the Fleet has 25% `maxSurge` and `maxUnavailable` by running `k get fleets.agones.dev/simple-game-server -o json | jq ".spec.strategy"`
     ```json
     {
       "rollingUpdate": {
         "maxSurge": "25%",
         "maxUnavailable": "25%"
       },
       "type": "RollingUpdate"
     }
     ```
 5. Set the number of replicas to `10`: `kubectl edit fleets.agones.dev/simple-game-server`
 6. Cause 3 allocations by running the following command: `kubectl create -f https://raw.githubusercontent.com/googleforgames/agones/release-1.35.0/examples/simple-game-server/gameserverallocation.yaml -o yaml`

### Trigger a RollingUpdate

 1. Run `watch kubectl get gameserversets` to make sure that the game servers are currently ready and not allocated.
     ```bash
     k get gameservers
     NAME                   SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
     simple-game-server-1   Packed       10        10        3           7       1m
     ```
 2. Run `kubectl edit fleets.agones.dev/simple-game-server` again, to change the game server version to 0.17 instead of 0.18.
 3. Trigger 3 more allocations with `kubectl create -f https://raw.githubusercontent.com/googleforgames/agones/release-1.35.0/examples/simple-game-server/gameserverallocation.yaml -o yaml`. The game server sets should look something like this:
    ```bash
    NAME                   SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
    simple-game-server-1   Packed       0         3         3           0       2m
    simple-game-server-2   Packed       7         7         3           4       1m
    ```
 4. Run `kubectl edit fleets.agones.dev/simple-game-server` one last time, to change the game server version to 0.16 instead of 0.17 to trigger one more rolling update.
 5. The game server sets now demonstrate the issue. `simple-game-server-2` remains with a hanging ready replica, which never gets reaped, even though that would bring the fleet back to its intended replica count and finish the update.
    ```bash
    NAME                   SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
    simple-game-server-1   Packed       0         3         3           0       3m
    simple-game-server-2   Packed       4         4         3           1       2m
    simple-game-server-3   Packed       4         4         0           4       1m
    ```

### Fix the broken RollingUpdate by updating the `maxUnavailable` value

Edit the update strategy to allow the game server set to scale down.

 1. `kubectl edit fleets.agones.dev/simple-game-server`
 2. Change the value of `maxUnavailable` to 80%
 3. Observe that the rolling update now works successfully, and the old game server states see their `DESIRED` replica count get to zero, and all of their `READY` replicas get reaped.
     ```bash
     NAME                  SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
    simple-game-server-1   Packed       0         3         3           0       4m
    simple-game-server-2   Packed       0         3         3           0       3m
    simple-game-server-3   Packed       4         4         0           4       2m
     ```

**Anything else we need to know?**:

In the following code snippet, `fleet` refers to the Fleet specification we have been editing, while `active` refers to the GameServerSet currently being updated and `rest` refers to the other GameServerSets that are not being updated.

Let's consider the case where there are the following three game server sets:

```bash
NAME                   SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-game-server-1   Packed       0         3         3           0       3m
simple-game-server-2   Packed       4         4         3           1       2m
simple-game-server-3   Packed       4         4         0           4       1m
```

When `rollingUpdateRestFixedOnReady` is called, `simple-game-server-3` is the active game server set, and `simple-game-server-1` and `simple-game-server-2` as part of the rest, the following happens:

1. `r, err := intstr.GetValueFromIntOrPercent(fleet.Spec.Strategy.RollingUpdate.MaxUnavailable, int(fleet.Spec.Replicas), false)` returns `.25 * 10` rounded down, which is `2`. This is then stored in the `unavailable` variable as an int32.
2. `readyReplicasCount` is `5`, since there are a total of two ready replicas across game server sets.
3. `minAvailable` is `8` (`fleet.Spec.Replicas - unavailable`, so `10 - 2`)
4. `cleanupUnhealthyReplicas` is then called, which does nothing since the ready replicas on outdated game server sets aren't considered unhealthy.
5. Then, the statement `if readyReplicasCount <= minAvailable { return nil }` causes the method to return without scaling down the ready replicas from the outdated game server set.
7. If `maxUnavailable` is then set to 80%, this changes the equation, since now  `minAvailable` is `2` instead of 8, which makes this `if` statement no longer true.

Here is the relevant code snippet, with irrelevant code to this issue commented out.

```go
package fleets
// from pkg/fleets/controller.go

func (c *Controller) rollingUpdateRestFixedOnReady(ctx context.Context, fleet *agonesv1.Fleet, active *agonesv1.GameServerSet, rest []*agonesv1.GameServerSet) error {
   // if len(rest) == 0 {
   //    return nil
   // }

   r, err := intstr.GetValueFromIntOrPercent(fleet.Spec.Strategy.RollingUpdate.MaxUnavailable, int(fleet.Spec.Replicas), false)
   if err != nil {
      return errors.Wrapf(err, "error parsing MaxUnavailable value: %s", fleet.ObjectMeta.Name)
   }
   // if r == 0 {
   //    r = 1
   // }
   // if r > int(fleet.Spec.Replicas) {
   //    r = int(fleet.Spec.Replicas)
   // }
   unavailable := int32(r)

   // totalAlreadyScaledDown := int32(0)
   // totalScaleDownCount := int32(0)
   allGSS := rest
   allGSS = append(allGSS, active)
   readyReplicasCount := agonesv1.GetReadyReplicaCountForGameServerSets(allGSS)
   minAvailable := fleet.Spec.Replicas - unavailable

   // allPodsCount := agonesv1.SumSpecReplicas(allGSS)
   // newGSSUnavailablePodCount := active.Spec.Replicas - active.Status.ReadyReplicas - active.Status.AllocatedReplicas
   // maxScaledDown := allPodsCount - minAvailable - newGSSUnavailablePodCount

   // if maxScaledDown <= 0 {
   //    return nil
   // }
   rest, _, err = c.cleanupUnhealthyReplicas(ctx, rest, fleet, maxScaledDown)
   if err != nil {
      return nil
   }
   // Resulting value is readyReplicasCount + unavailable - fleet.Spec.Replicas
   // totalScaleDownCount = readyReplicasCount - minAvailable
   if readyReplicasCount <= minAvailable {
      // Cannot scale down.
      return nil
   }
   // The rest of this method is irrelevant since in the case of this issue we always return in the above statement.
   return nil
}

func (c *Controller) cleanupUnhealthyReplicas(ctx context.Context, rest []*agonesv1.GameServerSet,
        fleet *agonesv1.Fleet, maxCleanupCount int32) ([]*agonesv1.GameServerSet, int32, error) {

   totalScaledDown := int32(0)
   for i, gsSet := range rest {
      if totalScaledDown >= maxCleanupCount {
         break
      }
      if gsSet.Spec.Replicas == 0 {
         continue
      }
      if gsSet.Spec.Replicas == gsSet.Status.ReadyReplicas {
         continue
      }

      // scaledDownCount := int32(integer.IntMin(int(maxCleanupCount-totalScaledDown), int(gsSet.Spec.Replicas-gsSet.Status.ReadyReplicas)))
      // newReplicasCount := gsSet.Spec.Replicas - scaledDownCount
      // if newReplicasCount > gsSet.Spec.Replicas {
      //    return nil, 0, fmt.Errorf("when cleaning up unhealthy replicas, got invalid request to scale down %s/%s %d -> %d", gsSet.Namespace, gsSet.Name, gsSet.Spec.Replicas, newReplicasCount)
      // }
	  // 
      // gsSetCopy := gsSet.DeepCopy()
      // gsSetCopy.Spec.Replicas = newReplicasCount
      // totalScaledDown += scaledDownCount
      // if _, err := c.gameServerSetGetter.GameServerSets(gsSetCopy.ObjectMeta.Namespace).Update(ctx, gsSetCopy, metav1.UpdateOptions{}); err != nil {
      //    return nil, totalScaledDown, errors.Wrapf(err, "error updating gameserverset %s", gsSetCopy.ObjectMeta.Name)
      // }
	  // 
      // rest[i] = gsSetCopy
   }
   return rest, totalScaledDown, nil
}
```

When this happens, it is even possible that a new allocation selects the ready replica from the **outdated game server set**, then causing the following state to occur:

#### Before Allocation

```bash
NAME                   SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-game-server-1   Packed       0         3         3           0       3m
simple-game-server-2   Packed       4         4         3           1       2m
simple-game-server-3   Packed       4         4         0           4       1m
```

#### After Allocation

```bash
NAME                   SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-game-server-1   Packed       0         3         3           0       3m
simple-game-server-2   Packed       4         4         4           0       2m
simple-game-server-3   Packed       3         3         0           3       1m
```

The latest game server state lost one of its desired replicas in the process! In cases where the old and new game server sets compete for the last available replica, this leads to a situation where the new game server set disappears in favor of the old one, like here:

```bash
NAME                   SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-game-server-1   Packed       3         6         6           0       12m
simple-game-server-2   Packed       4         4         4           0       11m
simple-game-server-3   Packed       0         0         0           0       10m
```

This issue can also be reproduced consistently with a Fleet Autoscaler, except that the case mentioned in the previous paragraph wouldn't happen, as long as the `maxReplicas` isn't reached.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RollingUpdate `maxUnavailable` should ensure to allow new allocations #3438

Install Agones

Create a Fleet

Trigger a RollingUpdate

Fix the broken RollingUpdate by updating the `maxUnavailable` value

Before Allocation

After Allocation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RollingUpdate maxUnavailable should ensure to allow new allocations #3438

Description

Install Agones

Create a Fleet

Trigger a RollingUpdate

Fix the broken RollingUpdate by updating the maxUnavailable value

Before Allocation

After Allocation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

RollingUpdate `maxUnavailable` should ensure to allow new allocations #3438

Fix the broken RollingUpdate by updating the `maxUnavailable` value