Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

dev/ci: stateless autoscaler: investigate revamped approach with dynamic jobs #32843

@bobheadxi

Description

@bobheadxi

Our current autoscaler implementation for stateless agents (https://github.com/sourcegraph/sourcegraph/issues/30233 , https://github.com/sourcegraph/infrastructure/pull/3057) has the following core components:

My understanding is that we:

  • Create a single Kubernetes Job
  • We mutate the parallelism of this one Job from buildkite-autoscaler on the fly in order to scale up/down the number of agents
  • Each agents exits when it finishes the Buildkite job assigned to it, and new agents are automatically created to meet the desired parallelism, unless that parallelism is decreased by the autoscaler

There are a few hints that this is not the best-practice way of doing things:

  • We have been running into sporadic, severe issues with stateless agent availability since we rolled out stateless builds to 75% of sourcegraph/sourcegraph builds. See thread, incident 92, and https://github.com/sourcegraph/sourcegraph/pull/32840
  • We seem to be applying a stateful management approach (scaling a single Job entity up and down) to what should probably be a stateless queue processing mechanism
  • We cannot set an unlimited completions count, and it is explicitly stated that our current value is a workaround:

    The completions attribute determines the number of times an agent will be spawned. This is set to 2147483647 and should not be edited.

  • We cannot set an unlimited backoffLimit, and it seems at least certain types of agent starts/restarts count towards the backoff limit, leading to incident 92 where we saw a lot of backoff exceeded issues which likely caused difficulties in scaling the stateless fleet up. The workaround is similar to the completions workaround: https://github.com/sourcegraph/infrastructure/pull/3176
  • Agent scaling behaviour appears potentially unpredictable, and prone to forceful cancellation by Kubernetes - if a job completes, and a new agent spins up to fulfill the desired parallelism, it might get immediately stopped by kubernetes as part of autoscaler scaledown.
  • Discussion during incident 92 was mostly one of confusion about how Kubernetes is responding to agent shutdowns, and confusion about how autoscaling is actually expected to behave

Judging from the Kubernetes Job docs, I think a more best-practice approach might be to dynamically generate Jobs on the fly:

  1. Query for pending jobs and running jobs
    a. If running > maxAgentsCount, do nothing
  2. Create a new Job for $count = pending (subject to maxAgentsCount and minAgentsCount), with completions: $count and parallelism: $count
    a. Job template source TBD. Ideas are a "no-op" Job deployed to K8s, simply pull from repo, or just embed within the autoscaler
    b. Creating Jobs from templates is an officially documented use case: https://kubernetes.io/docs/tasks/job/parallel-processing-expansion/
    c. This seems aligned with how completions should be used:

    As pods successfully complete, the Job tracks the successful completions. When a specified number of successful completions is reached, the task (ie, Job) is complete.

  3. Repeat

This aligns better with the intended usage of Job IMO - ephemeral, do-once runners that consume tasks and exits. In this case, we create Jobs on an interval to process chunks of the Buildkite queue. We can implement buffers on top by deploying buffer Jobs, parallelism buffers, or similar.


Work log: https://github.com/sourcegraph/devx-scratch/blob/main/2022/stateless-agents/log.snb.md

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions