You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Sep 30, 2024. It is now read-only.
We mutate the parallelism of this one Job from buildkite-autoscaler on the fly in order to scale up/down the number of agents
Each agents exits when it finishes the Buildkite job assigned to it, and new agents are automatically created to meet the desired parallelism, unless that parallelism is decreased by the autoscaler
There are a few hints that this is not the best-practice way of doing things:
We seem to be applying a stateful management approach (scaling a single Job entity up and down) to what should probably be a stateless queue processing mechanism
We cannot set an unlimited completions count, and it is explicitly stated that our current value is a workaround:
The completions attribute determines the number of times an agent will be spawned. This is set to 2147483647 and should not be edited.
We cannot set an unlimited backoffLimit, and it seems at least certain types of agent starts/restarts count towards the backoff limit, leading to incident 92 where we saw a lot of backoff exceeded issues which likely caused difficulties in scaling the stateless fleet up. The workaround is similar to the completions workaround: https://github.com/sourcegraph/infrastructure/pull/3176
Agent scaling behaviour appears potentially unpredictable, and prone to forceful cancellation by Kubernetes - if a job completes, and a new agent spins up to fulfill the desired parallelism, it might get immediately stopped by kubernetes as part of autoscaler scaledown.
Judging from the Kubernetes Job docs, I think a more best-practice approach might be to dynamically generate Jobs on the fly:
Query for pending jobs and running jobs
a. If running > maxAgentsCount, do nothing
Create a new Job for $count = pending (subject to maxAgentsCount and minAgentsCount), with completions: $count and parallelism: $count
a. Job template source TBD. Ideas are a "no-op" Job deployed to K8s, simply pull from repo, or just embed within the autoscaler
b. Creating Jobs from templates is an officially documented use case: https://kubernetes.io/docs/tasks/job/parallel-processing-expansion/
c. This seems aligned with how completions should be used:
As pods successfully complete, the Job tracks the successful completions. When a specified number of successful completions is reached, the task (ie, Job) is complete.
Repeat
This aligns better with the intended usage of Job IMO - ephemeral, do-once runners that consume tasks and exits. In this case, we create Jobs on an interval to process chunks of the Buildkite queue. We can implement buffers on top by deploying buffer Jobs, parallelism buffers, or similar.
Our current autoscaler implementation for stateless agents (https://github.com/sourcegraph/sourcegraph/issues/30233 , https://github.com/sourcegraph/infrastructure/pull/3057) has the following core components:
My understanding is that we:
Jobparallelismof this oneJobfrombuildkite-autoscaleron the fly in order to scale up/down the number of agentsparallelism, unless thatparallelismis decreased by the autoscalerThere are a few hints that this is not the best-practice way of doing things:
sourcegraph/sourcegraphbuilds. See thread, incident 92, and https://github.com/sourcegraph/sourcegraph/pull/32840Jobentity up and down) to what should probably be a stateless queue processing mechanismcompletionscount, and it is explicitly stated that our current value is a workaround:backoffLimit, and it seems at least certain types of agent starts/restarts count towards the backoff limit, leading to incident 92 where we saw a lot of backoff exceeded issues which likely caused difficulties in scaling the stateless fleet up. The workaround is similar to thecompletionsworkaround: https://github.com/sourcegraph/infrastructure/pull/3176parallelism, it might get immediately stopped by kubernetes as part of autoscaler scaledown.Judging from the Kubernetes
Jobdocs, I think a more best-practice approach might be to dynamically generateJobs on the fly:a. If running > maxAgentsCount, do nothing
Jobfor$count = pending(subject to maxAgentsCount and minAgentsCount), withcompletions: $countandparallelism: $counta. Job template source TBD. Ideas are a "no-op" Job deployed to K8s, simply pull from repo, or just embed within the autoscaler
b. Creating Jobs from templates is an officially documented use case: https://kubernetes.io/docs/tasks/job/parallel-processing-expansion/
c. This seems aligned with how
completionsshould be used:This aligns better with the intended usage of
JobIMO - ephemeral, do-once runners that consume tasks and exits. In this case, we create Jobs on an interval to process chunks of the Buildkite queue. We can implement buffers on top by deploying buffer Jobs,parallelismbuffers, or similar.Work log: https://github.com/sourcegraph/devx-scratch/blob/main/2022/stateless-agents/log.snb.md