Allow pool scheduling to fail independently for some errors by JamesMurkin · Pull Request #4655 · armadaproject/armada

JamesMurkin · 2026-02-02T15:34:44Z

Pool scheduling can now fail independently for errors that are deemed recoverable

Currently "recoverable" errors covers

Error from the internal scheduling algo
Error from the reconciler
Timeouts

Largely we're trying to cover issues that are caused by bugs rather than fundamental "critical" bugs such as our job db failing to upsert our changes (as this implies something is really wrong with our job db).

Unrecoverable errors will still cause all pools to fail and typically covers completely unexpected events where we should just abort

Over time we will likely change what is deemed recoverable - but from experience what is covered now will handle nearly all loop failures - as they're typically caused by bugs due to bad state being passed to the scheduling code that just errors because it gets into an unknown state.

This feature is disabled by default and is configured with disableIndependentPoolFailures

Currently SchedulerResult: - Is used as the external interface to scheduling_algo and the internal types, which means the internal types are using more complex objects than needed - Pool information is scattered around the result - Some information is duplicated (and could be inconsistent) such as ScheduledJobs/PreemptedJobs This PR refactors it so: - SchedulerResult is now better tailored only as an external result from scheduling_algo - The result is purely by pool - To avoid duplication - Simplify the representation (remove maps of result by pool) - Has helper functions to act similar as before if you just want all scheduled jobs etc Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> # Conflicts: # internal/scheduler/metrics/cycle_metrics.go # internal/scheduler/metrics/cycle_metrics_test.go # internal/scheduler/scheduler.go # internal/scheduler/scheduling/result.go # internal/scheduler/scheduling/scheduling_algo.go

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

mauriceyap · 2026-02-06T11:51:41Z

@Mergifyio refresh

mergify · 2026-02-06T13:37:55Z

refresh

✅ Pull request refreshed

…oject#4655) Pool scheduling can now fail independently for errors that are deemed recoverable Currently "recoverable" errors covers - Error from the internal scheduling algo - Error from the reconciler - Timeouts Largely we're trying to cover issues that are caused by bugs rather than fundamental "critical" bugs such as our job db failing to upsert our changes (as this implies something is really wrong with our job db). Unrecoverable errors will still cause all pools to fail and typically covers completely unexpected events where we should just abort Over time we will likely change what is deemed recoverable - but from experience what is covered now will handle nearly all loop failures - as they're typically caused by bugs due to bad state being passed to the scheduling code that just errors because it gets into an unknown state. This feature is disabled by default and is configured with `disableIndependentPoolFailures` --------- Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

JamesMurkin added 7 commits February 1, 2026 23:16

Lint

3486833

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Merge branch 'master' into scheduler_result_refactor

c86ca71

WIP Improve scheduler partial result

91220c2

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Improvements

de2e465

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Add metrics + fix reporting scheduler result metrics

f99c90e

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Improve logging

8b2507d

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

JamesMurkin marked this pull request as ready for review February 4, 2026 17:19

JamesMurkin added 6 commits February 4, 2026 17:23

Remove unused pools from cycle_metrics

1afd34b

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Tests

49598c7

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Lint

5f32661

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

tests

92da492

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Merge branch 'master' into improve_scheduler_partial_result

cf8d3a9

JamesMurkin changed the title ~~Improve scheduler partial result~~ Allow pool scheduling to fail independently for some errors Feb 5, 2026

MustafaI approved these changes Feb 6, 2026

View reviewed changes

Merge branch 'master' into improve_scheduler_partial_result

1a25551

JamesMurkin enabled auto-merge (squash) February 6, 2026 11:27

mauriceyap approved these changes Feb 6, 2026

View reviewed changes

JamesMurkin merged commit c4ea910 into master Feb 6, 2026
13 of 14 checks passed

JamesMurkin deleted the improve_scheduler_partial_result branch February 6, 2026 13:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow pool scheduling to fail independently for some errors#4655

Allow pool scheduling to fail independently for some errors#4655
JamesMurkin merged 14 commits intomasterfrom
improve_scheduler_partial_result

JamesMurkin commented Feb 2, 2026 •

edited

Loading

Uh oh!

mauriceyap commented Feb 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

mergify bot commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JamesMurkin commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mauriceyap commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Feb 6, 2026

✅ Pull request refreshed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JamesMurkin commented Feb 2, 2026 •

edited

Loading

mauriceyap commented Feb 6, 2026 •

edited

Loading