GitLab Runner Job Router (#19607) · Epics · Epics · GitLab.org

GitLab Runner Job Router

## Problem Statement GitLab's CI job scheduling is currently pull-based: GitLab Rails holds the job queue and runners repeatedly poll over HTTP to request available jobs. This model limits the ability to intelligently route jobs based on cost, performance, and capacity, provides no mechanism for enforcing custom policies before job assignment, gives runners no visibility into queue depth for informed autoscaling decisions, and offers no way to prioritize jobs (e.g. ensuring production deployments run before lower-priority work). ## Proposal Introduce the **Job Router**, a new component within GitLab Relay (previously KAS), that transforms CI job scheduling from pull-based polling into push-based routing. With GitLab Relay as the central communication layer between GitLab Rails and runners, the Job Router enables enforcing policies through programmable Runner Controllers, intelligently routing jobs to the right runners, prioritizing jobs based on business criticality, and providing queue signals to runners for smarter autoscaling. This evolution occurs across three phases: 1. **Smart Proxy with Admission Control** — replace runner HTTP polling with bidirectional gRPC streams through the Job Router. Introduce Runner Controllers as a programmable security layer for admit/deny decisions on jobs before runner assignment. Implement spillover-based runner prioritization within capability groups. 2. **Distributed Autoscaling Coordination** — enable queue depth autoscaling coordination across multiple GitLab Relay instances using capability fingerprinting and Redis-based distributed state, without taking over job orchestration yet. 3. **Full Job Orchestration** — the Job Router owns the complete job lifecycle. GitLab pushes all actionable jobs to it, which handles routing, admission, assignment, and state management with new job states (`routing → admitted → assigned`). Key design concepts: - **Capability Fingerprinting** — a stable hash of tags, runner type, protected status, and project access that groups compatible runners for efficient routing and autoscaling - **Spillover Algorithm** — higher priority runners get first choice within capability groups; lower priority runners only receive jobs when higher priority ones reach capacity - **Job Prioritization** — with centralized routing, the Job Router can order jobs by business criticality, ensuring high-priority work like production deployments is assigned before lower-priority jobs - **Admission Control** — user-implemented controllers using a standard gRPC interface enforce custom policies at instance, group, and project levels ## References - [Runner Job Router design document](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/runner_job_router/) - [Runner Admission Controller blueprint](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/runner_admission_controller/) - [Reverse gRPC Tunnel for Workspaces and CI](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/reverse-grpc-tunnel-workspaces-and-ci/)

epic