Skip to content

Using containerMode kubernetes causes random step failures #2805

@mhuijgen

Description

@mhuijgen

Checks

Controller Version

0.4.0

Helm Chart Version

0.4.0

CertManager Version

N/A

Deployment Method

Helm

cert-manager installation

cert-manager not required

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • My actions-runner-controller version (v0.x.y) does support the feature
  • I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
  • I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

containerMode:
  type: "kubernetes"
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    storageClassName: "default"
    resources:
      requests:
        storage: 16Gi

template:
  spec:
    restartPolicy: Never
    nodeSelector:
      kubernetes.io/os: linux
    initContainers:
    - name: init-k8s-volume-permissions
      image: ghcr.io/actions/actions-runner:latest
      command: ["sudo", "chown", "-R", "runner", "/home/runner/_work"]
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
    containers:
    - name: runner
      image: ghcr.io/actions/actions-runner:latest
      command: ["/home/runner/run.sh"]
      resources:
        requests:
          cpu: "1.3"

To Reproduce

This only occurs when running a large workflow so far. The workflow that fails has up to 20 parallel jobs running each comprised of about 9 steps. A few dozen jobs run successfully but some fail in a random step with the following error:

Run '/home/runner/k8s/index.js'
node:internal/process/promises:279
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "#<ErrorEvent>".] {
  code: 'ERR_UNHANDLED_REJECTION'
}
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.

Describe the bug

Some jobs abort in seemingly random steps . All fail with the error listed under reproduce step.

Describe the expected behavior

That the workflow has executed successfully dozens of times when using containermode dind.

I expect the workflow to also execute reliably when using containermode kubernetes.

Whole Controller Logs

Will update this after next failing run together with runner pod log of a failing job.

Whole Runner Pod Logs

Will try to extract the logs of the runner pod running a job that fails, but since the pods immediately disappear after failure I will need to stream all runner pod logs to a file and then filter it for a failing pod.

Additional Context

Note that I had to add the initcontainer that fixes the permission on the kubernetesModeWorkVolumeClaim pv because its provisioned by azure as an empty filesystem owned by root:root and the runner runs under user runner. Without it the runner pod itself immediately fails with an error that it cannot write to the _work folder.

This issue might actually be in https://github.com/actions/runner-container-hooks, if so desired Im happy to create a linked issue there.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set mode

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions