We are attempting to implement v0.8.0 but we are running in some issues.
In the examples that I will list below, we are using the main branch of the runner-container-hooks project. Because there are some issues that are potentially fixed, but not released yet. (e.g ad9cb43 and 2934de3)
Copy mechanism seems to be flaky
We are noticing that the copy mechanism does not seem to be consistent. Very often we see that there is a hash mismatch in the logs.
##[debug]Copying from job pod 'default-staging-wpn46-runner-dqrjw-workflow' /__w/_temp to /home/runner/_work/_temp
##[debug]Copying from pod default-staging-wpn46-runner-dqrjw-workflow /__w/_temp to /home/runner/_work/_temp
##[debug]internalExecOutput response: {"metadata":{},"status":"Success"}
##[debug]The hash of the directory does not match the expected value; want='e2412fb6f1a17893780dc33b1171cf815588b606687149a59a3fa3f5dd2eeb21' got='1a5e6d0abd2ea4bcb8ff8e895dcddd125583aa238d576c2def934ff269f43c3b'
##[debug]The hash of the directory does not match the expected value; want='e2412fb6f1a17893780dc33b1171cf815588b606687149a59a3fa3f5dd2eeb21' got='29da1d24c4c060aed2b19c1a0d5e2f45938c33c52ed8b9a2900cff2c339f5400'
##[debug]The hash of the directory does not match the expected value; want='e2412fb6f1a17893780dc33b1171cf815588b606687149a59a3fa3f5dd2eeb21' got='29da1d24c4c060aed2b19c1a0d5e2f45938c33c52ed8b9a2900cff2c339f5400'
##[debug]The hash of the directory does not match the expected value; want='e2412fb6f1a17893780dc33b1171cf815588b606687149a59a3fa3f5dd2eeb21' got='29da1d24c4c060aed2b19c1a0d5e2f45938c33c52ed8b9a2900cff2c339f5400'
##[debug]The hash of the directory does not match the expected value; want='e2412fb6f1a17893780dc33b1171cf815588b606687149a59a3fa3f5dd2eeb21' got='29da1d24c4c060aed2b19c1a0d5e2f45938c33c52ed8b9a2900cff2c339f5400'
When looking at the code we can see that it lists all files, sorts them and then creates a hash on both the runner and workflow pod. If both hashes do not match, the above warning is thrown and the process is retried (up to 15 times).
After adding some extra debug statements, we could see which files were missing.
Here are some files that were present on the runner pod, but not on the workflow pod.
./_runner_file_commands/save_state_f696e750-d1dc-4caf-bda4-0cf0c008127e
./_runner_file_commands
./53e43c7f-7461-4d9c-87d3-2579590c5f4d/lib/node_modules/npm/node_modules/qrcode-terminal/example/basic.png
./_runner_file_commands/save_state_f696e750-d1dc-4caf-bda4-0cf0c008127e
Here are some files that were present on the workflow pod, but not on the runner pod.
./_runner_file_commands
./tm-pnt-dummy/tm-pnt-dummy # -> this is our checked out git repository, so potentially a big issue
./fe7da619-91fd-4916-94dd-ed1d83a259b5.sh
./_runner_file_commands/set_output_f696e750-d1dc-4caf-bda4-0cf0c008127e
These files were not the same every time you would re-run the workflow. Meaning that the way we copy over the files is flaky.
I noticed that the question was also asked if we could get rid of this hash check, as it will continue anyway after failing the hash for 15 times. But that doesn't seem to be a good idea because the fact that some files are missing can have big consequences. Assume that you have a specific artifact that is not copied over in the next step. But in that next step you create your release tarball. This could result in your release tarball being completely useless as it is missing an artifact, worse is that you will not even know about this.
Copy mechanism uses quite some resources
A second downside of the copy mechanism is that it uses quite some resources.
We expected all workflows to be ran in a workflow pod. The runner pod was only responsible to create/delete the workflow and step pods. Thus, we set rather low memory requests and limits on our runner pod.
However, we noticed that our runner pod was getting OOMKilled. After investigating why we found out that in the case of the example workflow below, the runner pod took up 1 CPU and 445MB of RAM. Just because it is copying files from/to the workflow pod.
❯ kubectl top pods -n gha-runner-scale-sets
NAME CPU(cores) MEMORY(bytes)
default-staging-wpn46-runner-dqrjw 923m 445Mi
default-staging-wpn46-runner-dqrjw-workflow 1m 174Mi
You can use the workflow below to reproduce the issues listed above.
name: Showcase copy issues
on:
workflow_dispatch:
jobs:
test:
runs-on: default-staging
container:
image: ubuntu:latest
steps:
- name: Install git
run: |
apt-get update
apt-get install -y git
- name: Checkout repository
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 #v5.0.0
- name: Set git safe directory
shell: bash
run: git config --global --add safe.directory "$GITHUB_WORKSPACE"
- name: Setup Node.js
uses: actions/setup-node@2028fbc5c25fe9cf00d9f06a71cc4710d4507903 #v6.0.0
with:
node-version: 24
- name: Semantic Release
id: semantic
uses: cycjimmy/semantic-release-action@ba330626c4750c19d8299de843f05c7aa5574f62 #v5.0.2
with:
extra_plugins: |
conventional-changelog-conventionalcommits
After spending some time with v0.8.0 over the last weeks, I am wondering why we stepped away from using volumes and using the copy mechanism instead?
Creating the volume once and mount that to any workflow and step pods seems to be a more elegant solution than the copy mechanism that is used today.
I do think that there are more advantages in using a volume instead of using the copy mechanism.
If we for some reason need to keep using the copy mechanism, would it be possible to also provide the option to use volumes instead?
I know I could create a fork of 0.7.0 and implement #235, but I would rather follow the upstream project.
Thanks in advance!
We are attempting to implement v0.8.0 but we are running in some issues.
In the examples that I will list below, we are using the
mainbranch of therunner-container-hooksproject. Because there are some issues that are potentially fixed, but not released yet. (e.g ad9cb43 and 2934de3)Copy mechanism seems to be flaky
We are noticing that the copy mechanism does not seem to be consistent. Very often we see that there is a hash mismatch in the logs.
When looking at the code we can see that it lists all files, sorts them and then creates a hash on both the runner and workflow pod. If both hashes do not match, the above warning is thrown and the process is retried (up to 15 times).
After adding some extra debug statements, we could see which files were missing.
Here are some files that were present on the runner pod, but not on the workflow pod.
Here are some files that were present on the workflow pod, but not on the runner pod.
These files were not the same every time you would re-run the workflow. Meaning that the way we copy over the files is flaky.
I noticed that the question was also asked if we could get rid of this hash check, as it will continue anyway after failing the hash for 15 times. But that doesn't seem to be a good idea because the fact that some files are missing can have big consequences. Assume that you have a specific artifact that is not copied over in the next step. But in that next step you create your release tarball. This could result in your release tarball being completely useless as it is missing an artifact, worse is that you will not even know about this.
Copy mechanism uses quite some resources
A second downside of the copy mechanism is that it uses quite some resources.
We expected all workflows to be ran in a workflow pod. The runner pod was only responsible to create/delete the workflow and step pods. Thus, we set rather low memory requests and limits on our runner pod.
However, we noticed that our runner pod was getting OOMKilled. After investigating why we found out that in the case of the example workflow below, the runner pod took up 1 CPU and 445MB of RAM. Just because it is copying files from/to the workflow pod.
You can use the workflow below to reproduce the issues listed above.
After spending some time with v0.8.0 over the last weeks, I am wondering why we stepped away from using volumes and using the copy mechanism instead?
Creating the volume once and mount that to any workflow and step pods seems to be a more elegant solution than the copy mechanism that is used today.
I do think that there are more advantages in using a volume instead of using the copy mechanism.
If we for some reason need to keep using the copy mechanism, would it be possible to also provide the option to use volumes instead?
I know I could create a fork of 0.7.0 and implement #235, but I would rather follow the upstream project.
Thanks in advance!