Skip to content

Add ability to add annotations to Runner Pods once they start running a job #2562

@Nuru

Description

@Nuru

What would you like added?

Currently, you can add annotations to every Pod in a RunnerDeployment by adding them to the RunnerDeployment Spec under

spec:
  template:
    metadata:
      annotations:

I would like the ability to specify annotations to be added to Pods at the time the Pods are assigned jobs, so that idle Pods waiting for jobs do not have the same annotations as Pods running jobs.

Why is this needed?

Kubernetes cluster autoscaling solutions generally expect that a Pod runs a service that can be terminated on one Node and restarted on another with only a short duration needed to finish processing any in-flight requests. When the cluster is resized, the Cluster Autoscaler will do just that. However, GitHub Action Runner Jobs do not fit this model. If a Pod is terminated in the middle of a job, the job is lost. The likelihood of this happening is increased by the fact that the Action Runner Controller Autoscaler is expanding and contracting the size of the Runner Pool on a regular basis, causing the Cluster Autoscaler to more frequently want to scale up or scale down the EKS cluster, and, consequently, to move
Pods around.

In order to handle situations like this, cluster autoscalers typically allow Pods to indicate that they cannot be safely interrupted via an annotation. For the Kubernetes Cluster Autoscaler, you can add the annotation

"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"

For Karpenter, you can add the annotation

karpenter.sh/do-not-evict: "true"

An annotation like this should be added to Pods running jobs, so that the job can finish.

However, we do not want this annotation on idle Pods waiting for jobs. Otherwise, the Cluster Autoscaler would be prevented from removing nodes where the idle Pods are waiting, which is exactly the opposite of what we want.

The obvious solution is to have the ARC add the annotation to the Pod once a job is assigned to it. In the case of persistent runners, the annotation should be removed once the job is finished.

Additional context

It is practically impossible to run very long jobs on a Runner which the Cluster Autoscaler can terminate and evict unless the cluster is very stable in its capacity. Currently the only acceptable solutions are:

  1. Set minReplicas = 0 and add the annotation to all pods, solving the problem by never leaving idle Pods deployed
  2. Set up an ARC Autoscaler scheduled override to regularly drop minReplicas to zero to allow the Cluster Autoscaler to reclaim the Node(s) the idle Pod(s) are on

These solutions are less desirable because of (1) the lack of idle Runners to pick up jobs quickly and (2) long periods of time where the Cluster Autoscaler is prevented from scaling down the cluster.

Metadata

Metadata

Assignees

No one assigned

    Labels

    communityCommunity contributionenhancementNew feature or requestgha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions