Checks
Controller Version
0.13.0
Deployment Method
Helm
Checks
To Reproduce
1. Deploy `gha-runner-scale-set` with:
- Controller version: `0.13.0`
- Runner type: `kubernetes` with `type: novolume`
- `ACTIONS_RUNNER_CONTAINER_HOOKS=/home/runner/k8s-novolume/index.js`
- `k8s-novolume` hooks version: `0.13.0`
- Overlay filesystem, no PVC, only ephemeral storage.
2. Create a workflow that:
- Uses `actions/checkout@v4`
- Uses `actions/setup-go@v5` with `go-version-file` pointing to `go.mod`
- Runs `go test` with coverage
- Uploads artifacts with `actions/upload-artifact@v4`
- Publishes JUnit results with `dorny/test-reporter@v2`
3. Trigger the workflow on a Go repository (in our case, using a composite action that:
- Installs some system deps
- Runs `go test -v ./...` with coverage
- Generates `coverage.out`, HTML / text coverage reports, and `reports/junit.xml`
- Uploads them via `upload-artifact` and `dorny/test-reporter`).
4. Observe the job logs:
- For almost every step, you see:
`Run '/home/runner/k8s-novolume/index.js'`
followed by Node’s Buffer deprecation warning.
- Around `actions/upload-artifact@v4` and `dorny/test-reporter@v2`, the job appears to hang for tens of minutes.
5. While the job is “stuck”, exec into:
- The **workflow pod** and run `ps -ef`:
you see `cd /__w/_temp && find . ... -exec stat ...`.
- The **runner pod** and run `ps -ef`:
you see `cd /home/runner/_work/_temp && find . ... -exec stat ...`.
6. Check the size of `_temp`:
- In the workflow pod:
- `du -sh /__w/_temp` → ~`309M`
- `find /__w/_temp | wc -l` → ~`14026` files
- The disk itself is fast (e.g. `dd` shows hundreds of MB/s).
7. While a step is slow (e.g. around `upload-artifact` / `test-reporter`):
- Manually delete the Go setup temp directory under `_temp` in **both** pods, e.g.:
- Runner pod: `rm -rf /home/runner/_work/_temp/<go-setup-guid-folder>`
- Workflow pod: `rm -rf /__w/_temp/<go-setup-guid-folder>`
8. After removing those directories, observe:
- The currently-running step completes within minutes instead of tens of minutes.
- Remaining steps also complete quickly.
- Overall job runtime drops dramatically.
Describe the bug
When using gha-runner-scale-set (controller version 0.13.0) with:
- Runner type:
kubernetes + type: novolume
ACTIONS_RUNNER_CONTAINER_HOOKS=/home/runner/k8s-novolume/index.js
k8s-novolume hooks version: 0.13.0
we see severe slowdowns whenever _temp becomes non-trivial in size.
For every step, the container hook runs:
- In the workflow pod:
cd /__w/_temp && find . -not -path '*/_runner_hook_responses*' -exec stat -c '%b %n' {} \;
- In the runner pod:
cd /home/runner/_work/_temp && find . -not -path '*/_runner_hook_responses*' -exec stat -c '%b %n' {} \;
With Go workflows that use actions/setup-go@v5, _temp grows to ~300MB and ~14k files because the Go toolchain is extracted to /__w/_temp/<guid>/... and then cached.
Once _temp reaches that size, each invocation of index.js spends a long time scanning the entire _temp tree via find + stat. As a result:
- Steps like
actions/upload-artifact@v4 and dorny/test-reporter@v2 appear to hang for tens of minutes.
- The overall job runtime grows to 50+ minutes, even though the actual
go test work finishes much earlier.
The underlying disk is not the bottleneck (tested with dd, showing ~700MB/s writes and ~300MB/s reads). The expensive part is the repeated full scan of _temp from the hook.
In addition to that, novolume mode seems to synchronize the contents of _temp between the runner pod and the workflow pod. Both pods show mirrored contents under:
/home/runner/_work/_temp (runner pod)
/__w/_temp (workflow pod)
This suggests that some kind of pod-to-pod copy over the Kubernetes API is happening (implementation-wise, it looks similar in effect to a kubectl cp / tar-stream-like behavior). When _temp contains hundreds of MB and thousands of files, this cross-pod sync likely adds even more overhead on top of the find/stat scan, further amplifying the slowdown.
If we manually delete the Go-related temp directory under _temp in both pods while a step is slow, the step completes quickly and the rest of the workflow also runs fast. This strongly suggests that:
- The
_temp disk-usage scan in k8s-novolume, and
- The implied copy/synchronization of that directory between the two pods
are the main causes of the slowdown for this scenario.
Describe the expected behavior
I would expect the k8s-novolume hooks to:
- Not introduce large overhead relative to the actual job workload.
- Avoid scanning a large
_temp tree with find + stat on every hook invocation, especially when _temp is populated by common actions like actions/setup-go@v5.
- Avoid doing heavy, full-directory pod-to-pod copies over the Kubernetes API for
_temp when it contains hundreds of MB and many files.
Either the implementation should:
- Limit the scan/copy to a smaller, dedicated directory,
- Run such checks less frequently,
- Or provide a way to disable / relax these disk-usage and sync operations when they become too expensive.
In practice, for this job I would expect:
- The total runtime to be dominated by
go test, uploads, and reporting logic.
- No tens-of-minutes hangs around
upload-artifact or dorny/test-reporter.
- No need to manually clean
_temp inside the pods as a workaround.
- No full
_temp copy/sync between runner pod and workflow pod on each hook invocation when _temp is large.
Additional Context
Controller Logs
Runner Pod Logs
Checks
Controller Version
0.13.0
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
When using
gha-runner-scale-set(controller version0.13.0) with:kubernetes+type: novolumeACTIONS_RUNNER_CONTAINER_HOOKS=/home/runner/k8s-novolume/index.jsk8s-novolumehooks version:0.13.0we see severe slowdowns whenever
_tempbecomes non-trivial in size.For every step, the container hook runs:
cd /__w/_temp && find . -not -path '*/_runner_hook_responses*' -exec stat -c '%b %n' {} \;cd /home/runner/_work/_temp && find . -not -path '*/_runner_hook_responses*' -exec stat -c '%b %n' {} \;With Go workflows that use
actions/setup-go@v5,_tempgrows to ~300MB and ~14k files because the Go toolchain is extracted to/__w/_temp/<guid>/...and then cached.Once
_tempreaches that size, each invocation ofindex.jsspends a long time scanning the entire_temptree viafind + stat. As a result:actions/upload-artifact@v4anddorny/test-reporter@v2appear to hang for tens of minutes.go testwork finishes much earlier.The underlying disk is not the bottleneck (tested with
dd, showing ~700MB/s writes and ~300MB/s reads). The expensive part is the repeated full scan of_tempfrom the hook.In addition to that,
novolumemode seems to synchronize the contents of_tempbetween the runner pod and the workflow pod. Both pods show mirrored contents under:/home/runner/_work/_temp(runner pod)/__w/_temp(workflow pod)This suggests that some kind of pod-to-pod copy over the Kubernetes API is happening (implementation-wise, it looks similar in effect to a
kubectl cp/ tar-stream-like behavior). When_tempcontains hundreds of MB and thousands of files, this cross-pod sync likely adds even more overhead on top of thefind/statscan, further amplifying the slowdown.If we manually delete the Go-related temp directory under
_tempin both pods while a step is slow, the step completes quickly and the rest of the workflow also runs fast. This strongly suggests that:_tempdisk-usage scan ink8s-novolume, andare the main causes of the slowdown for this scenario.
Describe the expected behavior
I would expect the
k8s-novolumehooks to:_temptree withfind + staton every hook invocation, especially when_tempis populated by common actions likeactions/setup-go@v5._tempwhen it contains hundreds of MB and many files.Either the implementation should:
In practice, for this job I would expect:
go test, uploads, and reporting logic.upload-artifactordorny/test-reporter._tempinside the pods as a workaround._tempcopy/sync between runner pod and workflow pod on each hook invocation when_tempis large.Additional Context
NAController Logs
Runner Pod Logs