Skip to content

Allow pool scheduling to fail independently for some errors#4655

Merged
JamesMurkin merged 14 commits intomasterfrom
improve_scheduler_partial_result
Feb 6, 2026
Merged

Allow pool scheduling to fail independently for some errors#4655
JamesMurkin merged 14 commits intomasterfrom
improve_scheduler_partial_result

Conversation

@JamesMurkin
Copy link
Contributor

@JamesMurkin JamesMurkin commented Feb 2, 2026

Pool scheduling can now fail independently for errors that are deemed recoverable

Currently "recoverable" errors covers

  • Error from the internal scheduling algo
  • Error from the reconciler
  • Timeouts

Largely we're trying to cover issues that are caused by bugs rather than fundamental "critical" bugs such as our job db failing to upsert our changes (as this implies something is really wrong with our job db).

Unrecoverable errors will still cause all pools to fail and typically covers completely unexpected events where we should just abort

Over time we will likely change what is deemed recoverable - but from experience what is covered now will handle nearly all loop failures - as they're typically caused by bugs due to bad state being passed to the scheduling code that just errors because it gets into an unknown state.

This feature is disabled by default and is configured with disableIndependentPoolFailures

Currently SchedulerResult:
 - Is used as the external interface to scheduling_algo and the internal types, which means the internal types are using more complex objects than needed
 - Pool information is scattered around the result
 - Some information is duplicated (and could be inconsistent) such as ScheduledJobs/PreemptedJobs

This PR refactors it so:
 - SchedulerResult is now better tailored only as an external result from scheduling_algo
 - The result is purely by pool
   - To avoid duplication
   - Simplify the representation (remove maps of result by pool)
 - Has helper functions to act similar as before if you just want all scheduled jobs etc

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
@JamesMurkin JamesMurkin marked this pull request as ready for review February 4, 2026 17:19
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

# Conflicts:
#	internal/scheduler/metrics/cycle_metrics.go
#	internal/scheduler/metrics/cycle_metrics_test.go
#	internal/scheduler/scheduler.go
#	internal/scheduler/scheduling/result.go
#	internal/scheduler/scheduling/scheduling_algo.go
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
@JamesMurkin JamesMurkin changed the title Improve scheduler partial result Allow pool scheduling to fail independently for some errors Feb 5, 2026
@JamesMurkin JamesMurkin enabled auto-merge (squash) February 6, 2026 11:27
@mauriceyap
Copy link
Collaborator

mauriceyap commented Feb 6, 2026

@Mergifyio refresh

@JamesMurkin JamesMurkin merged commit c4ea910 into master Feb 6, 2026
13 of 14 checks passed
@JamesMurkin JamesMurkin deleted the improve_scheduler_partial_result branch February 6, 2026 13:23
@mergify
Copy link

mergify bot commented Feb 6, 2026

refresh

✅ Pull request refreshed

dslear pushed a commit to dslear/armada that referenced this pull request Feb 9, 2026
…oject#4655)

Pool scheduling can now fail independently for errors that are deemed
recoverable

Currently "recoverable" errors covers
 - Error from the internal scheduling algo
 - Error from the reconciler
 - Timeouts

Largely we're trying to cover issues that are caused by bugs rather than
fundamental "critical" bugs such as our job db failing to upsert our
changes (as this implies something is really wrong with our job db).

Unrecoverable errors will still cause all pools to fail and typically
covers completely unexpected events where we should just abort

Over time we will likely change what is deemed recoverable - but from
experience what is covered now will handle nearly all loop failures - as
they're typically caused by bugs due to bad state being passed to the
scheduling code that just errors because it gets into an unknown state.

This feature is disabled by default and is configured with
`disableIndependentPoolFailures`

---------

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants