dev/ci: stateless autoscaler: investigate revamped approach with dynamic jobs

Our current autoscaler implementation for stateless agents (https://github.com/sourcegraph/sourcegraph/issues/30233 , https://github.com/sourcegraph/infrastructure/pull/3057) has the following core components:

- https://github.com/sourcegraph/infrastructure/blob/7471e8a553c457ea9862f9e0ddbfc7315cd841bf/buildkite/kubernetes/buildkite-agent-stateless/README.md
- https://github.com/sourcegraph/infrastructure/blob/7471e8a553c457ea9862f9e0ddbfc7315cd841bf/docker-images/buildkite-autoscaler/buildkite-autoscaler.go

My understanding is that we:

- Create a single Kubernetes [`Job`](https://kubernetes.io/docs/concepts/workloads/controllers/job)
- We mutate the `parallelism` of this one `Job` from `buildkite-autoscaler` on the fly in order to scale up/down the number of agents
- Each agents exits when it finishes the _Buildkite job_ assigned to it, and new agents are automatically created to meet the desired `parallelism`, unless that `parallelism` is decreased by the autoscaler

There are a few hints that this is not the best-practice way of doing things:

- We have been running into sporadic, severe issues with stateless agent availability since we [rolled out stateless builds to 75% of `sourcegraph/sourcegraph` builds](https://github.com/sourcegraph/sourcegraph/pull/32751). See [thread](https://sourcegraph.slack.com/archives/C02MWRMAFR8/p1647838869486089), [incident 92](https://app.incident.io/incidents/92), and https://github.com/sourcegraph/sourcegraph/pull/32840
- We seem to be applying a stateful management approach (scaling a single `Job` entity up and down) to what should probably be a stateless queue processing mechanism
- We cannot set an unlimited `completions` count, and it is explicitly stated that our current value is a workaround:
   > The completions attribute determines the number of times an agent will be spawned. This is set to 2147483647 and should not be edited.
- We cannot set an unlimited `backoffLimit`, and it seems at least certain types of agent starts/restarts count towards the backoff limit, leading to [incident 92](https://app.incident.io/incidents/92) where we saw a lot of backoff exceeded issues which likely caused difficulties in scaling the stateless fleet up. The workaround is similar to the `completions` workaround: https://github.com/sourcegraph/infrastructure/pull/3176
- Agent scaling behaviour appears potentially unpredictable, and prone to forceful cancellation by Kubernetes - if a job completes, and a new agent spins up to fulfill the desired `parallelism`, it might get immediately stopped by kubernetes as part of autoscaler scaledown.
- Discussion during [incident 92](https://app.incident.io/incidents/92) was mostly one of confusion about how Kubernetes is responding to agent shutdowns, and [confusion about how autoscaling is actually expected to behave](https://github.com/sourcegraph/infrastructure/pull/3177#issuecomment-1074122791)

Judging from [the Kubernetes `Job` docs](https://kubernetes.io/docs/concepts/workloads/controllers/job), I think a more best-practice approach might be to dynamically generate `Job`s on the fly:

1. Query for pending jobs and running jobs
  	a. If running > maxAgentsCount, do nothing
2. Create **a new `Job`** for `$count = pending` (subject to maxAgentsCount and minAgentsCount), with `completions: $count` and `parallelism: $count`
	a. Job template source TBD. Ideas are a "no-op" Job deployed to K8s, simply pull from repo, or just embed within the autoscaler 
	b. Creating Jobs from templates is an officially documented use case: https://kubernetes.io/docs/tasks/job/parallel-processing-expansion/
	c. This seems aligned with how `completions` should be used:
	> As pods successfully complete, the Job tracks the successful completions. When a specified number of successful completions is reached, the task (ie, Job) is complete.
4. Repeat

This aligns better with the intended usage of `Job` IMO - ephemeral, do-once runners that consume tasks and exits. In this case, we create Jobs on an interval to process chunks of the Buildkite queue. We can implement buffers on top by deploying buffer Jobs, `parallelism` buffers, or similar.

---

Work log: https://github.com/sourcegraph/devx-scratch/blob/main/2022/stateless-agents/log.snb.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dev/ci: stateless autoscaler: investigate revamped approach with dynamic jobs #32843

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

dev/ci: stateless autoscaler: investigate revamped approach with dynamic jobs #32843

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions