Checks
Controller Version
0.4.0
Helm Chart Version
0.4.0
CertManager Version
N/A
Deployment Method
Helm
cert-manager installation
cert-manager not required
Checks
Resource Definitions
containerMode:
type: "kubernetes"
kubernetesModeWorkVolumeClaim:
accessModes: ["ReadWriteOnce"]
storageClassName: "default"
resources:
requests:
storage: 16Gi
template:
spec:
restartPolicy: Never
nodeSelector:
kubernetes.io/os: linux
initContainers:
- name: init-k8s-volume-permissions
image: ghcr.io/actions/actions-runner:latest
command: ["sudo", "chown", "-R", "runner", "/home/runner/_work"]
volumeMounts:
- name: work
mountPath: /home/runner/_work
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command: ["/home/runner/run.sh"]
resources:
requests:
cpu: "1.3"
To Reproduce
This only occurs when running a large workflow so far. The workflow that fails has up to 20 parallel jobs running each comprised of about 9 steps. A few dozen jobs run successfully but some fail in a random step with the following error:
Run '/home/runner/k8s/index.js'
node:internal/process/promises:279
triggerUncaughtException(err, true /* fromPromise */);
^
[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "#<ErrorEvent>".] {
code: 'ERR_UNHANDLED_REJECTION'
}
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.
Describe the bug
Some jobs abort in seemingly random steps . All fail with the error listed under reproduce step.
Describe the expected behavior
That the workflow has executed successfully dozens of times when using containermode dind.
I expect the workflow to also execute reliably when using containermode kubernetes.
Whole Controller Logs
Will update this after next failing run together with runner pod log of a failing job.
Whole Runner Pod Logs
Will try to extract the logs of the runner pod running a job that fails, but since the pods immediately disappear after failure I will need to stream all runner pod logs to a file and then filter it for a failing pod.
Additional Context
Note that I had to add the initcontainer that fixes the permission on the kubernetesModeWorkVolumeClaim pv because its provisioned by azure as an empty filesystem owned by root:root and the runner runs under user runner. Without it the runner pod itself immediately fails with an error that it cannot write to the _work folder.
This issue might actually be in https://github.com/actions/runner-container-hooks, if so desired Im happy to create a linked issue there.
Checks
Controller Version
0.4.0
Helm Chart Version
0.4.0
CertManager Version
N/A
Deployment Method
Helm
cert-manager installation
cert-manager not required
Checks
Resource Definitions
To Reproduce
Describe the bug
Some jobs abort in seemingly random steps . All fail with the error listed under reproduce step.
Describe the expected behavior
That the workflow has executed successfully dozens of times when using containermode dind.
I expect the workflow to also execute reliably when using containermode kubernetes.
Whole Controller Logs
Whole Runner Pod Logs
Additional Context
Note that I had to add the initcontainer that fixes the permission on the kubernetesModeWorkVolumeClaim pv because its provisioned by azure as an empty filesystem owned by root:root and the runner runs under user runner. Without it the runner pod itself immediately fails with an error that it cannot write to the _work folder.
This issue might actually be in https://github.com/actions/runner-container-hooks, if so desired Im happy to create a linked issue there.