-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Checks
- I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
- I'm not using a custom entrypoint in my runner image
Controller Version
0.27.1
Helm Chart Version
0.23.0
CertManager Version
1.10.2
Deployment Method
Helm
cert-manager installation
Yes, installed cert manager from official sources.
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
- I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
- My actions-runner-controller version (v0.x.y) does support the feature
- I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
- I've migrated to the workflow job webhook event (if you using webhook driven scaling)
Resource Definitions
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
annotations:
meta.helm.sh/release-name: infra-runner-arm64
meta.helm.sh/release-namespace: actions-runner-system
labels:
app.kubernetes.io/managed-by: Helm
name: infra-runner-arm64
namespace: actions-runner-system
spec:
maxReplicas: 64
minReplicas: 0
scaleDownDelaySecondsAfterScaleOut: 300
scaleTargetRef:
name: infra-runner-arm64
scaleUpTriggers:
- amount: 1
duration: 4m
githubEvent:
workflowJob: {}
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: infra-runner-arm64
namespace: actions-runner-system
spec:
template:
spec:
dockerdWithinRunnerContainer: true
env:
- name: RUNNER_GRACEFUL_STOP_TIMEOUT
value: '90'
group: armEnabled
image: summerwind/actions-runner-dind
imagePullPolicy: IfNotPresent
labels:
- self-hosted
- Linux
- linux
- Ubuntu
- ubuntu
- arm64
- ARM64
- aarch64
- core-auto
nodeSelector:
kubernetes.io/arch: arm64
kubernetes.io/os: linux
organization: <redacted>
resources:
limits:
cpu: 2000m
memory: 2048Mi
requests:
cpu: 1000m
memory: 1024Mi
serviceAccountName: actions-runner
terminationGracePeriodSeconds: 100
tolerations:
- effect: NoSchedule
key: kubernetes.io/arch
operator: Equal
value: arm64
volumeMounts:
- mountPath: /home/runner/work/shared
name: shared-volume
volumes:
- name: shared-volume
persistentVolumeClaim:
claimName: infra-runner-arm64To Reproduce
Set webhook trigger duration to a reasonable number like "5m" to cover the time for the HRA to scale up a runner. Set HRA minReplicas to 0. Start 5 times the HRA maxReplicas number of jobs, with each job taking longer than the webhook trigger duration to complete. In other words, create a huge backlog of jobs.
Watch the autoscaler scale the runner pool to zero runners while there remains a huge backlog of jobs in the queue.
Describe the bug
The capacity reservations expire before the jobs are even queued because the HRA cannot scale up past its maxReplicas. The webhook based autoscaler expires most of the jobs on the queue before they have had a chance to be started, and scales the runner pool down even though there are still a ton of jobs waiting.
Describe the expected behavior
The timer on the HRA duration should start either when the job is assigned to a runner or when the HRA tries to scale up. If the HRA is already at maxReplicas, the reservation should live indefinitely, until it has a chance to be assigned to an idle runner, at which point the duration timer can start.
Alternately there could be a separate timeout for how long a job can live in the backlog before the autoscaler forgets about it. It doesn't make sense to me that I have to include in the duration period the amount of time it might take the maxed-out cluster to work through a huge backlog which may be several hours, because, as I understand it, this is also the potential amount of time an idle runner will be left running before being scaled down if, for some reason, the job canceled/completed event is missed.
Whole Controller Logs
Logs have already rolled over.Whole Runner Pod Logs
Logs have already rolled overAdditional Context
You can see in the webhook server logs entries like this:
2023-04-02T21:01:17Z DEBUG controllers.webhookbasedautoscaler Patching hra infra-runner-arm64 for capacityReservations update {"before": 220, "expired": 15, "added": 0, "completed": -2, "after": 205}
I don't believe this is a coding error. I believe this is a flaw in the design of the capacity reservation system. Capacity that has been reserved but cannot be filled because the runner pool is already at max capacity should not expire while waiting for real capacity to become available.