Skip to content

Severe slowdown in k8s-novolume hooks due to _temp find/stat scan with actions/setup-go #4313

@mcammisa78

Description

@mcammisa78

Checks

Controller Version

0.13.0

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Deploy `gha-runner-scale-set` with:
   - Controller version: `0.13.0`
   - Runner type: `kubernetes` with `type: novolume`
   - `ACTIONS_RUNNER_CONTAINER_HOOKS=/home/runner/k8s-novolume/index.js`
   - `k8s-novolume` hooks version: `0.13.0`
   - Overlay filesystem, no PVC, only ephemeral storage.

2. Create a workflow that:
   - Uses `actions/checkout@v4`
   - Uses `actions/setup-go@v5` with `go-version-file` pointing to `go.mod`
   - Runs `go test` with coverage
   - Uploads artifacts with `actions/upload-artifact@v4`
   - Publishes JUnit results with `dorny/test-reporter@v2`

3. Trigger the workflow on a Go repository (in our case, using a composite action that:
   - Installs some system deps
   - Runs `go test -v ./...` with coverage
   - Generates `coverage.out`, HTML / text coverage reports, and `reports/junit.xml`
   - Uploads them via `upload-artifact` and `dorny/test-reporter`).

4. Observe the job logs:
   - For almost every step, you see:
     `Run '/home/runner/k8s-novolume/index.js'`
     followed by Node’s Buffer deprecation warning.
   - Around `actions/upload-artifact@v4` and `dorny/test-reporter@v2`, the job appears to hang for tens of minutes.

5. While the job is “stuck”, exec into:
   - The **workflow pod** and run `ps -ef`:
     you see `cd /__w/_temp && find . ... -exec stat ...`.
   - The **runner pod** and run `ps -ef`:
     you see `cd /home/runner/_work/_temp && find . ... -exec stat ...`.

6. Check the size of `_temp`:
   - In the workflow pod:
     - `du -sh /__w/_temp`~`309M`
     - `find /__w/_temp | wc -l`~`14026` files
   - The disk itself is fast (e.g. `dd` shows hundreds of MB/s).

7. While a step is slow (e.g. around `upload-artifact` / `test-reporter`):
   - Manually delete the Go setup temp directory under `_temp` in **both** pods, e.g.:
     - Runner pod: `rm -rf /home/runner/_work/_temp/<go-setup-guid-folder>`
     - Workflow pod: `rm -rf /__w/_temp/<go-setup-guid-folder>`

8. After removing those directories, observe:
   - The currently-running step completes within minutes instead of tens of minutes.
   - Remaining steps also complete quickly.
   - Overall job runtime drops dramatically.

Describe the bug

When using gha-runner-scale-set (controller version 0.13.0) with:

  • Runner type: kubernetes + type: novolume
  • ACTIONS_RUNNER_CONTAINER_HOOKS=/home/runner/k8s-novolume/index.js
  • k8s-novolume hooks version: 0.13.0

we see severe slowdowns whenever _temp becomes non-trivial in size.

For every step, the container hook runs:

  • In the workflow pod:
    • cd /__w/_temp && find . -not -path '*/_runner_hook_responses*' -exec stat -c '%b %n' {} \;
  • In the runner pod:
    • cd /home/runner/_work/_temp && find . -not -path '*/_runner_hook_responses*' -exec stat -c '%b %n' {} \;

With Go workflows that use actions/setup-go@v5, _temp grows to ~300MB and ~14k files because the Go toolchain is extracted to /__w/_temp/<guid>/... and then cached.

Once _temp reaches that size, each invocation of index.js spends a long time scanning the entire _temp tree via find + stat. As a result:

  • Steps like actions/upload-artifact@v4 and dorny/test-reporter@v2 appear to hang for tens of minutes.
  • The overall job runtime grows to 50+ minutes, even though the actual go test work finishes much earlier.

The underlying disk is not the bottleneck (tested with dd, showing ~700MB/s writes and ~300MB/s reads). The expensive part is the repeated full scan of _temp from the hook.

In addition to that, novolume mode seems to synchronize the contents of _temp between the runner pod and the workflow pod. Both pods show mirrored contents under:

  • /home/runner/_work/_temp (runner pod)
  • /__w/_temp (workflow pod)

This suggests that some kind of pod-to-pod copy over the Kubernetes API is happening (implementation-wise, it looks similar in effect to a kubectl cp / tar-stream-like behavior). When _temp contains hundreds of MB and thousands of files, this cross-pod sync likely adds even more overhead on top of the find/stat scan, further amplifying the slowdown.

If we manually delete the Go-related temp directory under _temp in both pods while a step is slow, the step completes quickly and the rest of the workflow also runs fast. This strongly suggests that:

  1. The _temp disk-usage scan in k8s-novolume, and
  2. The implied copy/synchronization of that directory between the two pods

are the main causes of the slowdown for this scenario.

Describe the expected behavior

I would expect the k8s-novolume hooks to:

  • Not introduce large overhead relative to the actual job workload.
  • Avoid scanning a large _temp tree with find + stat on every hook invocation, especially when _temp is populated by common actions like actions/setup-go@v5.
  • Avoid doing heavy, full-directory pod-to-pod copies over the Kubernetes API for _temp when it contains hundreds of MB and many files.

Either the implementation should:

  • Limit the scan/copy to a smaller, dedicated directory,
  • Run such checks less frequently,
  • Or provide a way to disable / relax these disk-usage and sync operations when they become too expensive.

In practice, for this job I would expect:

  • The total runtime to be dominated by go test, uploads, and reporting logic.
  • No tens-of-minutes hangs around upload-artifact or dorny/test-reporter.
  • No need to manually clean _temp inside the pods as a workaround.
  • No full _temp copy/sync between runner pod and workflow pod on each hook invocation when _temp is large.

Additional Context

NA

Controller Logs

NA

Runner Pod Logs

NA

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions