Skip to content

Copy mechanism does not seem to be consistent #275

@vvanouytsel-trendminer

Description

@vvanouytsel-trendminer

We are attempting to implement v0.8.0 but we are running in some issues.

In the examples that I will list below, we are using the main branch of the runner-container-hooks project. Because there are some issues that are potentially fixed, but not released yet. (e.g ad9cb43 and 2934de3)

Copy mechanism seems to be flaky

We are noticing that the copy mechanism does not seem to be consistent. Very often we see that there is a hash mismatch in the logs.

##[debug]Copying from job pod 'default-staging-wpn46-runner-dqrjw-workflow' /__w/_temp to /home/runner/_work/_temp
##[debug]Copying from pod default-staging-wpn46-runner-dqrjw-workflow /__w/_temp to /home/runner/_work/_temp
##[debug]internalExecOutput response: {"metadata":{},"status":"Success"}
##[debug]The hash of the directory does not match the expected value; want='e2412fb6f1a17893780dc33b1171cf815588b606687149a59a3fa3f5dd2eeb21' got='1a5e6d0abd2ea4bcb8ff8e895dcddd125583aa238d576c2def934ff269f43c3b'
##[debug]The hash of the directory does not match the expected value; want='e2412fb6f1a17893780dc33b1171cf815588b606687149a59a3fa3f5dd2eeb21' got='29da1d24c4c060aed2b19c1a0d5e2f45938c33c52ed8b9a2900cff2c339f5400'
##[debug]The hash of the directory does not match the expected value; want='e2412fb6f1a17893780dc33b1171cf815588b606687149a59a3fa3f5dd2eeb21' got='29da1d24c4c060aed2b19c1a0d5e2f45938c33c52ed8b9a2900cff2c339f5400'
##[debug]The hash of the directory does not match the expected value; want='e2412fb6f1a17893780dc33b1171cf815588b606687149a59a3fa3f5dd2eeb21' got='29da1d24c4c060aed2b19c1a0d5e2f45938c33c52ed8b9a2900cff2c339f5400'
##[debug]The hash of the directory does not match the expected value; want='e2412fb6f1a17893780dc33b1171cf815588b606687149a59a3fa3f5dd2eeb21' got='29da1d24c4c060aed2b19c1a0d5e2f45938c33c52ed8b9a2900cff2c339f5400'

When looking at the code we can see that it lists all files, sorts them and then creates a hash on both the runner and workflow pod. If both hashes do not match, the above warning is thrown and the process is retried (up to 15 times).

After adding some extra debug statements, we could see which files were missing.

Here are some files that were present on the runner pod, but not on the workflow pod.

./_runner_file_commands/save_state_f696e750-d1dc-4caf-bda4-0cf0c008127e
./_runner_file_commands
./53e43c7f-7461-4d9c-87d3-2579590c5f4d/lib/node_modules/npm/node_modules/qrcode-terminal/example/basic.png
./_runner_file_commands/save_state_f696e750-d1dc-4caf-bda4-0cf0c008127e

Here are some files that were present on the workflow pod, but not on the runner pod.

./_runner_file_commands
./tm-pnt-dummy/tm-pnt-dummy  # -> this is our checked out git repository, so potentially a big issue 
./fe7da619-91fd-4916-94dd-ed1d83a259b5.sh
./_runner_file_commands/set_output_f696e750-d1dc-4caf-bda4-0cf0c008127e

These files were not the same every time you would re-run the workflow. Meaning that the way we copy over the files is flaky.

I noticed that the question was also asked if we could get rid of this hash check, as it will continue anyway after failing the hash for 15 times. But that doesn't seem to be a good idea because the fact that some files are missing can have big consequences. Assume that you have a specific artifact that is not copied over in the next step. But in that next step you create your release tarball. This could result in your release tarball being completely useless as it is missing an artifact, worse is that you will not even know about this.

Copy mechanism uses quite some resources

A second downside of the copy mechanism is that it uses quite some resources.
We expected all workflows to be ran in a workflow pod. The runner pod was only responsible to create/delete the workflow and step pods. Thus, we set rather low memory requests and limits on our runner pod.

However, we noticed that our runner pod was getting OOMKilled. After investigating why we found out that in the case of the example workflow below, the runner pod took up 1 CPU and 445MB of RAM. Just because it is copying files from/to the workflow pod.

 ❯ kubectl top pods -n gha-runner-scale-sets 
NAME                                          CPU(cores)   MEMORY(bytes)   
default-staging-wpn46-runner-dqrjw            923m         445Mi           
default-staging-wpn46-runner-dqrjw-workflow   1m           174Mi 

You can use the workflow below to reproduce the issues listed above.

name: Showcase copy issues
on:
  workflow_dispatch:
jobs:
  test:
    runs-on: default-staging
    container:
      image: ubuntu:latest
    steps:
      - name: Install git
        run: |
          apt-get update
          apt-get install -y git
          
      - name: Checkout repository
        uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 #v5.0.0

      - name: Set git safe directory
        shell: bash
        run: git config --global --add safe.directory "$GITHUB_WORKSPACE"

      - name: Setup Node.js
        uses: actions/setup-node@2028fbc5c25fe9cf00d9f06a71cc4710d4507903 #v6.0.0
        with:
          node-version: 24

      - name: Semantic Release
        id: semantic
        uses: cycjimmy/semantic-release-action@ba330626c4750c19d8299de843f05c7aa5574f62 #v5.0.2
        with:
          extra_plugins: |
            conventional-changelog-conventionalcommits

After spending some time with v0.8.0 over the last weeks, I am wondering why we stepped away from using volumes and using the copy mechanism instead?

Creating the volume once and mount that to any workflow and step pods seems to be a more elegant solution than the copy mechanism that is used today.

I do think that there are more advantages in using a volume instead of using the copy mechanism.
If we for some reason need to keep using the copy mechanism, would it be possible to also provide the option to use volumes instead?

I know I could create a fork of 0.7.0 and implement #235, but I would rather follow the upstream project.

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions