Copy mechanism does not seem to be consistent

We are attempting to implement v0.8.0 but we are running in some issues.

In the examples that I will list below, we are using the `main` branch of the `runner-container-hooks` project. Because there are some issues that are potentially fixed, but not released yet. (e.g https://github.com/actions/runner-container-hooks/commit/ad9cb43c31d5b0a2d841da3139dd4163d4435ac9 and https://github.com/actions/runner-container-hooks/commit/2934de33f84c9f2fc4dd398242f3a4fd4d0b8180)

# Copy mechanism seems to be flaky

We are noticing that the copy mechanism does not seem to be consistent. Very often we see that there is a hash mismatch in the logs.

```sh
##[debug]Copying from job pod 'default-staging-wpn46-runner-dqrjw-workflow' /__w/_temp to /home/runner/_work/_temp
##[debug]Copying from pod default-staging-wpn46-runner-dqrjw-workflow /__w/_temp to /home/runner/_work/_temp
##[debug]internalExecOutput response: {"metadata":{},"status":"Success"}
##[debug]The hash of the directory does not match the expected value; want='e2412fb6f1a17893780dc33b1171cf815588b606687149a59a3fa3f5dd2eeb21' got='1a5e6d0abd2ea4bcb8ff8e895dcddd125583aa238d576c2def934ff269f43c3b'
##[debug]The hash of the directory does not match the expected value; want='e2412fb6f1a17893780dc33b1171cf815588b606687149a59a3fa3f5dd2eeb21' got='29da1d24c4c060aed2b19c1a0d5e2f45938c33c52ed8b9a2900cff2c339f5400'
##[debug]The hash of the directory does not match the expected value; want='e2412fb6f1a17893780dc33b1171cf815588b606687149a59a3fa3f5dd2eeb21' got='29da1d24c4c060aed2b19c1a0d5e2f45938c33c52ed8b9a2900cff2c339f5400'
##[debug]The hash of the directory does not match the expected value; want='e2412fb6f1a17893780dc33b1171cf815588b606687149a59a3fa3f5dd2eeb21' got='29da1d24c4c060aed2b19c1a0d5e2f45938c33c52ed8b9a2900cff2c339f5400'
##[debug]The hash of the directory does not match the expected value; want='e2412fb6f1a17893780dc33b1171cf815588b606687149a59a3fa3f5dd2eeb21' got='29da1d24c4c060aed2b19c1a0d5e2f45938c33c52ed8b9a2900cff2c339f5400'
```

When looking at the code we can see that it lists all files, sorts them and then creates a hash on both the runner and workflow pod. If both hashes do not match, the above warning is thrown and the process is retried (up to 15 times).

After adding some extra debug statements, we could see which files were missing.


Here are some files that were present on the runner pod, but not on the workflow pod.
```
./_runner_file_commands/save_state_f696e750-d1dc-4caf-bda4-0cf0c008127e
./_runner_file_commands
./53e43c7f-7461-4d9c-87d3-2579590c5f4d/lib/node_modules/npm/node_modules/qrcode-terminal/example/basic.png
./_runner_file_commands/save_state_f696e750-d1dc-4caf-bda4-0cf0c008127e
```

Here are some files that were present on the workflow pod, but not on the runner pod.
```
./_runner_file_commands
./tm-pnt-dummy/tm-pnt-dummy  # -> this is our checked out git repository, so potentially a big issue 
./fe7da619-91fd-4916-94dd-ed1d83a259b5.sh
./_runner_file_commands/set_output_f696e750-d1dc-4caf-bda4-0cf0c008127e
```

These files were not the same every time you would re-run the workflow. Meaning that the way we copy over the files is flaky.

I noticed that the question was also asked if we could get rid of this hash check, as it will continue anyway after failing the hash for 15 times. But that doesn't seem to be a good idea because the fact that some files are missing can have big consequences. Assume that you have a specific artifact that is not copied over in the next step. But in that next step you create your release tarball. This could result in your release tarball being completely useless as it is missing an artifact, worse is that you will not even know about this.


# Copy mechanism uses quite some resources

A second downside of the copy mechanism is that it uses quite some resources.
We expected all workflows to be ran in a workflow pod. The runner pod was only responsible to create/delete the workflow and step pods. Thus, we set rather low memory requests and limits on our runner pod.

However, we noticed that our runner pod was getting OOMKilled. After investigating why we found out that in the case of the example workflow below, the runner pod took up 1 CPU and 445MB of RAM. Just because it is copying files from/to the workflow pod.

```bash
 ❯ kubectl top pods -n gha-runner-scale-sets 
NAME                                          CPU(cores)   MEMORY(bytes)   
default-staging-wpn46-runner-dqrjw            923m         445Mi           
default-staging-wpn46-runner-dqrjw-workflow   1m           174Mi 
```

---
You can use the workflow below to reproduce the issues listed above.

```yaml
name: Showcase copy issues
on:
  workflow_dispatch:
jobs:
  test:
    runs-on: default-staging
    container:
      image: ubuntu:latest
    steps:
      - name: Install git
        run: |
          apt-get update
          apt-get install -y git
          
      - name: Checkout repository
        uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 #v5.0.0

      - name: Set git safe directory
        shell: bash
        run: git config --global --add safe.directory "$GITHUB_WORKSPACE"

      - name: Setup Node.js
        uses: actions/setup-node@2028fbc5c25fe9cf00d9f06a71cc4710d4507903 #v6.0.0
        with:
          node-version: 24

      - name: Semantic Release
        id: semantic
        uses: cycjimmy/semantic-release-action@ba330626c4750c19d8299de843f05c7aa5574f62 #v5.0.2
        with:
          extra_plugins: |
            conventional-changelog-conventionalcommits
```

After spending some time with v0.8.0 over the last weeks, I am wondering why we stepped away from using volumes and using the copy mechanism instead?

Creating the volume once and mount that to any workflow and step pods seems to be a more elegant solution than the copy mechanism that is used today.

I do think that there are more advantages in using a volume instead of using the copy mechanism.
If we for some reason need to keep using the copy mechanism, would it be possible to also provide the option to use volumes instead?

I know I could create a fork of 0.7.0 and implement https://github.com/actions/runner-container-hooks/pull/235, but I would rather follow the upstream project.

Thanks in advance!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copy mechanism does not seem to be consistent #275

Copy mechanism seems to be flaky

Copy mechanism uses quite some resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Copy mechanism does not seem to be consistent #275

Description

Copy mechanism seems to be flaky

Copy mechanism uses quite some resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions