Skip to content

Jobs expire while on queue #2466

@Nuru

Description

@Nuru

Checks

Controller Version

0.27.1

Helm Chart Version

0.23.0

CertManager Version

1.10.2

Deployment Method

Helm

cert-manager installation

Yes, installed cert manager from official sources.

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • My actions-runner-controller version (v0.x.y) does support the feature
  • I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
  • I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  annotations:
    meta.helm.sh/release-name: infra-runner-arm64
    meta.helm.sh/release-namespace: actions-runner-system

  labels:
    app.kubernetes.io/managed-by: Helm
  name: infra-runner-arm64
  namespace: actions-runner-system

spec:
  maxReplicas: 64
  minReplicas: 0
  scaleDownDelaySecondsAfterScaleOut: 300
  scaleTargetRef:
    name: infra-runner-arm64
  scaleUpTriggers:
    - amount: 1
      duration: 4m
      githubEvent:
        workflowJob: {}
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: infra-runner-arm64
  namespace: actions-runner-system

spec:
  template:
    spec:
      dockerdWithinRunnerContainer: true
      env:
        - name: RUNNER_GRACEFUL_STOP_TIMEOUT
          value: '90'
      group: armEnabled
      image: summerwind/actions-runner-dind
      imagePullPolicy: IfNotPresent
      labels:
        - self-hosted
        - Linux
        - linux
        - Ubuntu
        - ubuntu
        - arm64
        - ARM64
        - aarch64
        - core-auto
      nodeSelector:
        kubernetes.io/arch: arm64
        kubernetes.io/os: linux
      organization: <redacted>
      resources:
        limits:
          cpu: 2000m
          memory: 2048Mi
        requests:
          cpu: 1000m
          memory: 1024Mi
      serviceAccountName: actions-runner
      terminationGracePeriodSeconds: 100
      tolerations:
        - effect: NoSchedule
          key: kubernetes.io/arch
          operator: Equal
          value: arm64
      volumeMounts:
        - mountPath: /home/runner/work/shared
          name: shared-volume
      volumes:
        - name: shared-volume
          persistentVolumeClaim:
            claimName: infra-runner-arm64

To Reproduce

Set webhook trigger duration to a reasonable number like "5m" to cover the time for the HRA to scale up a runner. Set HRA minReplicas to 0. Start 5 times the HRA maxReplicas number of jobs, with each job taking longer than the webhook trigger duration to complete. In other words, create a huge backlog of jobs.

Watch the autoscaler scale the runner pool to zero runners while there remains a huge backlog of jobs in the queue.

Describe the bug

The capacity reservations expire before the jobs are even queued because the HRA cannot scale up past its maxReplicas. The webhook based autoscaler expires most of the jobs on the queue before they have had a chance to be started, and scales the runner pool down even though there are still a ton of jobs waiting.

Describe the expected behavior

The timer on the HRA duration should start either when the job is assigned to a runner or when the HRA tries to scale up. If the HRA is already at maxReplicas, the reservation should live indefinitely, until it has a chance to be assigned to an idle runner, at which point the duration timer can start.

Alternately there could be a separate timeout for how long a job can live in the backlog before the autoscaler forgets about it. It doesn't make sense to me that I have to include in the duration period the amount of time it might take the maxed-out cluster to work through a huge backlog which may be several hours, because, as I understand it, this is also the potential amount of time an idle runner will be left running before being scaled down if, for some reason, the job canceled/completed event is missed.

Whole Controller Logs

Logs have already rolled over.

Whole Runner Pod Logs

Logs have already rolled over

Additional Context

You can see in the webhook server logs entries like this:

2023-04-02T21:01:17Z	DEBUG	controllers.webhookbasedautoscaler	Patching hra infra-runner-arm64 for capacityReservations update	{"before": 220, "expired": 15, "added": 0, "completed": -2, "after": 205}

I don't believe this is a coding error. I believe this is a flaw in the design of the capacity reservation system. Capacity that has been reserved but cannot be filled because the runner pool is already at max capacity should not expire while waiting for real capacity to become available.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions