Skip to content

Add option to disable pod affinity#235

Closed
Wielewout wants to merge 2 commits intoactions:mainfrom
Wielewout:optional-pod-affinity
Closed

Add option to disable pod affinity#235
Wielewout wants to merge 2 commits intoactions:mainfrom
Wielewout:optional-pod-affinity

Conversation

@Wielewout
Copy link
Copy Markdown
Contributor

In #212 pod affinity was added when the kube scheduler is enabled. While a way better default, it makes less optimal use of resources in the cluster (or even breaks? some setups #201 (comment)).

By default pod affinity remains set, but by setting ACTIONS_RUNNER_USE_POD_AFFINITY=false pod affinity rules will be skipped.

When disabling, the runner and workflow pod can then be scheduled on different nodes again. It is up to the user though to support RWX volumes in the cluster, a node selector for architecture if using a multi-arch cluster (on both the runner and workflow pod so they match), ...

@zchenyu
Copy link
Copy Markdown

zchenyu commented Jul 16, 2025

Thanks! fwiw, I had the exact same implementation in my fork :)

@gigabyte132
Copy link
Copy Markdown

gigabyte132 commented Aug 6, 2025

This would help us a lot in our setup as well, as without it breaks our GPU runners whenever there is no availability of GPUs on the node that the job pod runs (even though there are GPUs available on other nodes).

@oed-lipphausent
Copy link
Copy Markdown

This is exactly what we need ! :D

Currently, we have the problem that the runner pods are created as expected, but we always have to wait for the workflow pods. Since the introduction of node affinity, our nodes do not have enough resources to process every workflow pod for every runner pod.
This is why we originally switched to the Kube Scheduler, but since the change, we have the problem again.

The change with node affinity makes using the Kube Scheduler pointless for us. We deliberately chose this path so that we could use smaller nodes and scale the number of nodes to save costs.

However, since this PR has been open for more than a month, I wonder if it is realistic to expect this change to be considered in the near future.

I don't know who is responsible for this maybe, @nikola-jokic, but it would be nice to see some feedback on this PR

@nikola-jokic
Copy link
Copy Markdown
Collaborator

Hey everyone, we are currently working on a PR that will disable the affinity and volume mounts completely.

@oed-lipphausent
Copy link
Copy Markdown

@nikola-jokic, thank you for the update. I didn't know you were working on something like this, and I'm very excited to see how it develops. I think it's a nice idea.

@jennyluciav
Copy link
Copy Markdown

@nikola-jokic! thanks for working on this. Do you have a rough timeline for the fix to be implemented?

@nikola-jokic
Copy link
Copy Markdown
Collaborator

Hey @jennyluciav,

The target date is Okt 13th, but it might be sooner. Most of the work has been done, and I'd love to test it a bit more to make sure everything works at least for most cases. I'm a bit worried about permissions, since with volume mounts, we could easily apply it to any directory while with copy, there are certain folders where it might result in an error, but that is mostly for user volume mounts that are likely not frequent.
There are few things that we are working on right now, not just this feature, but if we finish that work sooner, we will issue a release as soon as we are done, and we will not wait for the target date.

@LeonoreMangold
Copy link
Copy Markdown

LeonoreMangold commented Oct 6, 2025

@nikola-jokic @Wielewout Why was this pull-request closed, is the work being moved to another PR ? I also have the problem where I'm using RWX for the work volume and I need the runner and workflow pod to be able to schedule to different nodes to fit with the resource requests they are attributed.

@Wielewout
Copy link
Copy Markdown
Contributor Author

Wielewout commented Oct 6, 2025

@nikola-jokic @Wielewout Why was this pull-request closed, is the work being moved to another PR ?

This PR became obsolete because of #244. Instead of using a volume in both the runner and workflow pod, the runner will copy files to the workflow pod. This also removes the requirement to keep both pods on the same node.

@vvanouytsel
Copy link
Copy Markdown
Contributor

Did anyone get #244 in a working state?
When using container-hooks version 0.8.0 my Initialize Containers step always fails with:

Run '/home/runner/k8s/index.js'
(node:66) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
Error: Error: cpToPod failed after 30 attempts: {}
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.

@vvanouytsel
Copy link
Copy Markdown
Contributor

I still think the possibility should be added to disable nodeAffinity.

The solution in #244 does not support all use cases.

For example:

Any chance we can get this PR reopened?

cc @Wielewout @nikola-jokic

@nikola-jokic
Copy link
Copy Markdown
Collaborator

Hey @vvanouytsel , we can't keep two parallel versions of the hook at the same time. Since the implementation is 0.8.0 is not heavily tested on every environment, we applied both versions on the runner, so you can fallback to the 0.7.0 version of the hook.
If you are blocked, you can re-build the hook and put your own version inside the container.
In the meantime, we will be fixing the reported issues on the 0.8.0 version, which should eventually fully replace the 0.7.0 version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants