Skip to content

feat: preserve partial scheduling progress on context timeout instead of rolling back all work#4559

Merged
dejanzele merged 6 commits intoarmadaproject:masterfrom
dejanzele:feat/scheduler-graceful-shutdown
Feb 6, 2026
Merged

feat: preserve partial scheduling progress on context timeout instead of rolling back all work#4559
dejanzele merged 6 commits intoarmadaproject:masterfrom
dejanzele:feat/scheduler-graceful-shutdown

Conversation

@dejanzele
Copy link
Member

@dejanzele dejanzele commented Dec 2, 2025

What type of PR is this?

Enhancement

What this PR does / why we need it

Previously, when the scheduler hit its timeout during the scheduling cycle, it would return an error and discard all work, even jobs that were successfully scheduled before the timeout.

This change implements a two-tier timeout system for the scheduler to handle long scheduling cycles gracefully.

This change introduces a new config field in the scheduler: newJobsSchedulingTimeout:

  # scheduler.config.yaml
  scheduling:
    # Hard timeout - absolute maximum duration for a scheduling cycle.
    # When exceeded, cycle aborts with error and discards all work.
    maxSchedulingDuration: 10s

    # Soft timeout - stop scheduling new jobs after this duration.
    # Evicted jobs continue to be rescheduled until hard timeout.
    # Set to 0 to disable soft timeout behavior.
    # Must be less than maxSchedulingDuration when non-zero.
    newJobsSchedulingTimeout: 8s

Expected output

When soft timeout fires:

  INFO Soft timeout reached for pool default, switching to evicted-only mode
  INFO Looping through candidate gangs for pool default...
  INFO Scheduled 873 jobs for pool default

When hard timeout fires (unchanged behavior):

  ERROR hard timeout: context deadline exceeded

How to test

Check the section at the end called Additional Files for the test script and test Armada job.

  1. Configure scheduler with a short timeout in _local/scheduler/config.yaml:
maxSchedulingDuration: 200ms
  1. Configure fake executor with enough capacity in _local/fakeexecutor/config.yaml:
nodes:
  - name: "fake-node"
    count: 50
    allocatable:
      cpu: "64"
      memory: "256Gi"
  1. Start the local environment: goreman -f _local/procfiles/fake-executor.Procfile start
  2. Create two test queues:
armadactl create queue queue-a
armadactl create queue queue-b
  1. Run the following commands to generate jobs:
./scripts/submit-jobs.sh -c 5000 -q queue-a -j jobset-timeout-a example/fair-share-test.yaml
./scripts/submit-jobs.sh -c 5000 -q queue-b -j jobset-timeout-b example/fair-share-test.yaml
  1. Assert that the following logs appear in the output of the scheduler:
  INFO Timeout reached for pool default, switching to evicted-only mode
  INFO Scheduling cycle interrupted by context deadline exceeded: scheduled 873 jobs for pool default
  INFO Scheduled on executor pool default in 19.983083ms with error <nil>
Additional Files
# scripts/submit-jobs.sh

#!/bin/bash
set -e

COUNT=1
JOBSET="test-jobset"
QUEUE="test-queue"
JOB_TEMPLATE=""
MAX_PARALLEL=50

while [[ $# -gt 0 ]]; do
    case $1 in
        -c|--count) COUNT="$2"; shift 2 ;;
        -j|--jobset) JOBSET="$2"; shift 2 ;;
        -q|--queue) QUEUE="$2"; shift 2 ;;
        -p|--parallel) MAX_PARALLEL="$2"; shift 2 ;;
        -*) echo "Unknown option $1"; exit 1 ;;
        *) JOB_TEMPLATE="$1"; shift ;;
    esac
done

[[ -z "$JOB_TEMPLATE" ]] && JOB_TEMPLATE="example/fair-share-test.yaml"
[[ ! -f "$JOB_TEMPLATE" ]] && echo "Error: $JOB_TEMPLATE not found" && exit 1

ARMADACTL="./armadactl"
[[ ! -f "$ARMADACTL" ]] && ARMADACTL="armadactl"

TEMP_DIR=$(mktemp -d)
trap "rm -rf $TEMP_DIR" EXIT

JOB_FILE="$TEMP_DIR/job.yaml"
sed -e "s/^jobSetId:.*/jobSetId: $JOBSET/" -e "s/^queue:.*/queue: $QUEUE/" "$JOB_TEMPLATE" > "$JOB_FILE"

$ARMADACTL create queue "$QUEUE" 2>/dev/null || true

echo "Submitting $COUNT batches to queue '$QUEUE' jobset '$JOBSET'..."

PIDS=()
for ((i=1; i<=COUNT; i++)); do
    $ARMADACTL submit "$JOB_FILE" >/dev/null 2>&1 &
    PIDS+=($!)
    if ((${#PIDS[@]} >= MAX_PARALLEL)) || ((i == COUNT)); then
        for pid in "${PIDS[@]}"; do wait $pid; done
        PIDS=()
        echo "Progress: $i/$COUNT"
    fi
done

echo "Done. Submitted $COUNT batches to queue '$QUEUE'"
# example/fair-share-test.yaml

queue: test-queue
jobSetId: fair-share-test
jobs:
  - namespace: default
    priority: 1000
    podSpec: &podspec
      terminationGracePeriodSeconds: 0
      restartPolicy: Never
      containers:
        - name: worker
          image: busybox:latest
          command: ["sleep", "3600"]
          resources:
            limits:
              memory: 64Mi
              cpu: 50m
            requests:
              memory: 64Mi
              cpu: 50m
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec

nikola-jokic
nikola-jokic previously approved these changes Dec 2, 2025
d80tb7
d80tb7 previously requested changes Dec 3, 2025
Copy link
Collaborator

@d80tb7 d80tb7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this works in the case of premption.

@dejanzele dejanzele force-pushed the feat/scheduler-graceful-shutdown branch 9 times, most recently from 27cd367 to fa91955 Compare December 18, 2025 00:04
@dejanzele dejanzele force-pushed the feat/scheduler-graceful-shutdown branch 8 times, most recently from 9eaf27f to 800f2c1 Compare February 5, 2026 12:46
… of rolling back all work

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the feat/scheduler-graceful-shutdown branch from 800f2c1 to 196b8d3 Compare February 5, 2026 12:47
JamesMurkin
JamesMurkin previously approved these changes Feb 5, 2026
@JamesMurkin JamesMurkin dismissed d80tb7’s stale review February 6, 2026 12:29

We've changed approach and I'm happy this one should work

@dejanzele dejanzele enabled auto-merge (squash) February 6, 2026 12:29
@dejanzele dejanzele merged commit e694299 into armadaproject:master Feb 6, 2026
15 checks passed
dslear pushed a commit to dslear/armada that referenced this pull request Feb 9, 2026
… of rolling back all work (armadaproject#4559)

<!-- Thanks for sending a pull request! Here are some tips for you: -->

#### What type of PR is this?

Enhancement

#### What this PR does / why we need it

Previously, when the scheduler hit its timeout during the scheduling
cycle, it would return an error and discard all work, even jobs that
were successfully scheduled before the timeout.

This change implements a two-tier timeout system for the scheduler to
handle long scheduling cycles gracefully.

This change introduces a new config field in the scheduler:
`newJobsSchedulingTimeout`:
```
  # scheduler.config.yaml
  scheduling:
    # Hard timeout - absolute maximum duration for a scheduling cycle.
    # When exceeded, cycle aborts with error and discards all work.
    maxSchedulingDuration: 10s

    # Soft timeout - stop scheduling new jobs after this duration.
    # Evicted jobs continue to be rescheduled until hard timeout.
    # Set to 0 to disable soft timeout behavior.
    # Must be less than maxSchedulingDuration when non-zero.
    newJobsSchedulingTimeout: 8s
```

#### Expected output

When soft timeout fires:
```
  INFO Soft timeout reached for pool default, switching to evicted-only mode
  INFO Looping through candidate gangs for pool default...
  INFO Scheduled 873 jobs for pool default
```

When hard timeout fires (unchanged behavior):
```
  ERROR hard timeout: context deadline exceeded
```

#### How to test

Check the section at the end called **Additional Files** for the test
script and test Armada job.

1. Configure scheduler with a short timeout in
`_local/scheduler/config.yaml`:
```
maxSchedulingDuration: 200ms
```
2. Configure fake executor with enough capacity in
`_local/fakeexecutor/config.yaml`:
```
nodes:
  - name: "fake-node"
    count: 50
    allocatable:
      cpu: "64"
      memory: "256Gi"
```
3. Start the local environment: `goreman -f
_local/procfiles/fake-executor.Procfile start`
4. Create two test queues:
```
armadactl create queue queue-a
armadactl create queue queue-b
```
5. Run the following commands to generate jobs:
```
./scripts/submit-jobs.sh -c 5000 -q queue-a -j jobset-timeout-a example/fair-share-test.yaml
./scripts/submit-jobs.sh -c 5000 -q queue-b -j jobset-timeout-b example/fair-share-test.yaml
```
6. Assert that the following logs appear in the output of the scheduler:
```
  INFO Timeout reached for pool default, switching to evicted-only mode
  INFO Scheduling cycle interrupted by context deadline exceeded: scheduled 873 jobs for pool default
  INFO Scheduled on executor pool default in 19.983083ms with error <nil>
```

##### Additional Files
```
# scripts/submit-jobs.sh

#!/bin/bash
set -e

COUNT=1
JOBSET="test-jobset"
QUEUE="test-queue"
JOB_TEMPLATE=""
MAX_PARALLEL=50

while [[ $# -gt 0 ]]; do
    case $1 in
        -c|--count) COUNT="$2"; shift 2 ;;
        -j|--jobset) JOBSET="$2"; shift 2 ;;
        -q|--queue) QUEUE="$2"; shift 2 ;;
        -p|--parallel) MAX_PARALLEL="$2"; shift 2 ;;
        -*) echo "Unknown option $1"; exit 1 ;;
        *) JOB_TEMPLATE="$1"; shift ;;
    esac
done

[[ -z "$JOB_TEMPLATE" ]] && JOB_TEMPLATE="example/fair-share-test.yaml"
[[ ! -f "$JOB_TEMPLATE" ]] && echo "Error: $JOB_TEMPLATE not found" && exit 1

ARMADACTL="./armadactl"
[[ ! -f "$ARMADACTL" ]] && ARMADACTL="armadactl"

TEMP_DIR=$(mktemp -d)
trap "rm -rf $TEMP_DIR" EXIT

JOB_FILE="$TEMP_DIR/job.yaml"
sed -e "s/^jobSetId:.*/jobSetId: $JOBSET/" -e "s/^queue:.*/queue: $QUEUE/" "$JOB_TEMPLATE" > "$JOB_FILE"

$ARMADACTL create queue "$QUEUE" 2>/dev/null || true

echo "Submitting $COUNT batches to queue '$QUEUE' jobset '$JOBSET'..."

PIDS=()
for ((i=1; i<=COUNT; i++)); do
    $ARMADACTL submit "$JOB_FILE" >/dev/null 2>&1 &
    PIDS+=($!)
    if ((${#PIDS[@]} >= MAX_PARALLEL)) || ((i == COUNT)); then
        for pid in "${PIDS[@]}"; do wait $pid; done
        PIDS=()
        echo "Progress: $i/$COUNT"
    fi
done

echo "Done. Submitted $COUNT batches to queue '$QUEUE'"
```

```
# example/fair-share-test.yaml

queue: test-queue
jobSetId: fair-share-test
jobs:
  - namespace: default
    priority: 1000
    podSpec: &podspec
      terminationGracePeriodSeconds: 0
      restartPolicy: Never
      containers:
        - name: worker
          image: busybox:latest
          command: ["sleep", "3600"]
          resources:
            limits:
              memory: 64Mi
              cpu: 50m
            requests:
              memory: 64Mi
              cpu: 50m
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
```

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants