Using containerMode kubernetes causes random step failures

### Checks

- [X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I'm not using a custom entrypoint in my runner image

### Controller Version

0.4.0

### Helm Chart Version

0.4.0

### CertManager Version

N/A

### Deployment Method

Helm

### cert-manager installation

cert-manager not required

### Checks

- [X] This isn't a question or user support case (For Q&A and community support, go to [Discussions](https://github.com/actions/actions-runner-controller/discussions). It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
- [X] I've read [releasenotes](https://github.com/actions/actions-runner-controller/tree/master/docs/releasenotes) before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
- [X] My actions-runner-controller version (v0.x.y) does support the feature
- [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
- [X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)

### Resource Definitions

```yaml
containerMode:
  type: "kubernetes"
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    storageClassName: "default"
    resources:
      requests:
        storage: 16Gi

template:
  spec:
    restartPolicy: Never
    nodeSelector:
      kubernetes.io/os: linux
    initContainers:
    - name: init-k8s-volume-permissions
      image: ghcr.io/actions/actions-runner:latest
      command: ["sudo", "chown", "-R", "runner", "/home/runner/_work"]
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
    containers:
    - name: runner
      image: ghcr.io/actions/actions-runner:latest
      command: ["/home/runner/run.sh"]
      resources:
        requests:
          cpu: "1.3"
```


### To Reproduce

```markdown
This only occurs when running a large workflow so far. The workflow that fails has up to 20 parallel jobs running each comprised of about 9 steps. A few dozen jobs run successfully but some fail in a random step with the following error:

Run '/home/runner/k8s/index.js'
node:internal/process/promises:279
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "#<ErrorEvent>".] {
  code: 'ERR_UNHANDLED_REJECTION'
}
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.
```


### Describe the bug

Some jobs abort in seemingly random steps . All fail with the error listed under reproduce step.

### Describe the expected behavior

That the workflow has executed successfully dozens of times when using containermode dind.

I expect the workflow to also execute reliably when using containermode kubernetes.

### Whole Controller Logs

```shell
Will update this after next failing run together with runner pod log of a failing job.
```


### Whole Runner Pod Logs

```shell
Will try to extract the logs of the runner pod running a job that fails, but since the pods immediately disappear after failure I will need to stream all runner pod logs to a file and then filter it for a failing pod.
```


### Additional Context

Note that I had to add the initcontainer that fixes the permission on the kubernetesModeWorkVolumeClaim pv because its provisioned by azure as an empty filesystem owned by root:root and the runner runs under user runner. Without it the runner pod itself immediately fails with an error that it cannot write to the _work folder.

This issue might actually be in https://github.com/actions/runner-container-hooks, if so desired Im happy to create a linked issue there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using containerMode kubernetes causes random step failures #2805

Checks

Controller Version

Helm Chart Version

CertManager Version

Deployment Method

cert-manager installation

Checks

Resource Definitions

To Reproduce

Describe the bug

Describe the expected behavior

Whole Controller Logs

Whole Runner Pod Logs

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Using containerMode kubernetes causes random step failures #2805

Description

Checks

Controller Version

Helm Chart Version

CertManager Version

Deployment Method

cert-manager installation

Checks

Resource Definitions

To Reproduce

Describe the bug

Describe the expected behavior

Whole Controller Logs

Whole Runner Pod Logs

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions