Jobs expire while on queue

### Checks

- [X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I'm not using a custom entrypoint in my runner image

### Controller Version

0.27.1

### Helm Chart Version

0.23.0

### CertManager Version

1.10.2

### Deployment Method

Helm

### cert-manager installation

Yes, installed cert manager from official sources.

### Checks

- [X] This isn't a question or user support case (For Q&A and community support, go to [Discussions](https://github.com/actions/actions-runner-controller/discussions). It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
- [X] I've read [releasenotes](https://github.com/actions/actions-runner-controller/tree/master/docs/releasenotes) before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
- [X] My actions-runner-controller version (v0.x.y) does support the feature
- [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
- [X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)

### Resource Definitions

```yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  annotations:
    meta.helm.sh/release-name: infra-runner-arm64
    meta.helm.sh/release-namespace: actions-runner-system

  labels:
    app.kubernetes.io/managed-by: Helm
  name: infra-runner-arm64
  namespace: actions-runner-system

spec:
  maxReplicas: 64
  minReplicas: 0
  scaleDownDelaySecondsAfterScaleOut: 300
  scaleTargetRef:
    name: infra-runner-arm64
  scaleUpTriggers:
    - amount: 1
      duration: 4m
      githubEvent:
        workflowJob: {}
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: infra-runner-arm64
  namespace: actions-runner-system

spec:
  template:
    spec:
      dockerdWithinRunnerContainer: true
      env:
        - name: RUNNER_GRACEFUL_STOP_TIMEOUT
          value: '90'
      group: armEnabled
      image: summerwind/actions-runner-dind
      imagePullPolicy: IfNotPresent
      labels:
        - self-hosted
        - Linux
        - linux
        - Ubuntu
        - ubuntu
        - arm64
        - ARM64
        - aarch64
        - core-auto
      nodeSelector:
        kubernetes.io/arch: arm64
        kubernetes.io/os: linux
      organization: <redacted>
      resources:
        limits:
          cpu: 2000m
          memory: 2048Mi
        requests:
          cpu: 1000m
          memory: 1024Mi
      serviceAccountName: actions-runner
      terminationGracePeriodSeconds: 100
      tolerations:
        - effect: NoSchedule
          key: kubernetes.io/arch
          operator: Equal
          value: arm64
      volumeMounts:
        - mountPath: /home/runner/work/shared
          name: shared-volume
      volumes:
        - name: shared-volume
          persistentVolumeClaim:
            claimName: infra-runner-arm64
```


### To Reproduce


Set webhook trigger `duration` to a reasonable number like "5m" to cover the time for the HRA to scale up a runner. Set HRA `minReplicas` to 0. Start 5 times the HRA `maxReplicas` number of jobs, with each job taking longer than the webhook trigger `duration` to complete. In other words, create a huge backlog of jobs.

Watch the autoscaler scale the runner pool to zero runners while there remains a huge backlog of jobs in the queue.



### Describe the bug

The capacity reservations expire before the jobs are even queued because the HRA cannot scale up past its `maxReplicas`. The webhook based autoscaler expires most of the jobs on the queue before they have had a chance to be started, and scales the runner pool down even though there are still a ton of jobs waiting.

### Describe the expected behavior

The timer on the HRA `duration` should start either when the job is assigned to a runner or when the HRA tries to scale up. If the HRA is already at  `maxReplicas`, the reservation should live indefinitely, until it has a chance to be assigned to an idle runner, at which point the duration timer can start. 

Alternately there could be a separate timeout for how long a job can live in the backlog before the autoscaler forgets about it. It doesn't make sense to me that I have to include in the `duration` period the amount of time it might take the maxed-out cluster to work through a huge backlog which may be several hours, because, as I understand it, this is also the potential amount of time an idle runner will be left running before being scaled down if, for some reason, the job canceled/completed event is missed. 

### Whole Controller Logs

```shell
Logs have already rolled over.
```


### Whole Runner Pod Logs

```shell
Logs have already rolled over
```


### Additional Context

You can see in the webhook server logs entries like this:
```
2023-04-02T21:01:17Z	DEBUG	controllers.webhookbasedautoscaler	Patching hra infra-runner-arm64 for capacityReservations update	{"before": 220, "expired": 15, "added": 0, "completed": -2, "after": 205}
```

I don't believe this is a coding error. I believe this is a flaw in the design of the capacity reservation system. Capacity that has been reserved but cannot be filled because the runner pool is already at max capacity should not expire while waiting for real capacity to become available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs expire while on queue #2466

Checks

Controller Version

Helm Chart Version

CertManager Version

Deployment Method

cert-manager installation

Checks

Resource Definitions

To Reproduce

Describe the bug

Describe the expected behavior

Whole Controller Logs

Whole Runner Pod Logs

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Jobs expire while on queue #2466

Description

Checks

Controller Version

Helm Chart Version

CertManager Version

Deployment Method

cert-manager installation

Checks

Resource Definitions

To Reproduce

Describe the bug

Describe the expected behavior

Whole Controller Logs

Whole Runner Pod Logs

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions