Allow pool scheduling to fail independently for some errors#4655
Merged
JamesMurkin merged 14 commits intomasterfrom Feb 6, 2026
Merged
Allow pool scheduling to fail independently for some errors#4655JamesMurkin merged 14 commits intomasterfrom
JamesMurkin merged 14 commits intomasterfrom
Conversation
Currently SchedulerResult: - Is used as the external interface to scheduling_algo and the internal types, which means the internal types are using more complex objects than needed - Pool information is scattered around the result - Some information is duplicated (and could be inconsistent) such as ScheduledJobs/PreemptedJobs This PR refactors it so: - SchedulerResult is now better tailored only as an external result from scheduling_algo - The result is purely by pool - To avoid duplication - Simplify the representation (remove maps of result by pool) - Has helper functions to act similar as before if you just want all scheduled jobs etc Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> # Conflicts: # internal/scheduler/metrics/cycle_metrics.go # internal/scheduler/metrics/cycle_metrics_test.go # internal/scheduler/scheduler.go # internal/scheduler/scheduling/result.go # internal/scheduler/scheduling/scheduling_algo.go
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
MustafaI
approved these changes
Feb 6, 2026
mauriceyap
approved these changes
Feb 6, 2026
Collaborator
|
@Mergifyio refresh |
✅ Pull request refreshed |
dslear
pushed a commit
to dslear/armada
that referenced
this pull request
Feb 9, 2026
…oject#4655) Pool scheduling can now fail independently for errors that are deemed recoverable Currently "recoverable" errors covers - Error from the internal scheduling algo - Error from the reconciler - Timeouts Largely we're trying to cover issues that are caused by bugs rather than fundamental "critical" bugs such as our job db failing to upsert our changes (as this implies something is really wrong with our job db). Unrecoverable errors will still cause all pools to fail and typically covers completely unexpected events where we should just abort Over time we will likely change what is deemed recoverable - but from experience what is covered now will handle nearly all loop failures - as they're typically caused by bugs due to bad state being passed to the scheduling code that just errors because it gets into an unknown state. This feature is disabled by default and is configured with `disableIndependentPoolFailures` --------- Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pool scheduling can now fail independently for errors that are deemed recoverable
Currently "recoverable" errors covers
Largely we're trying to cover issues that are caused by bugs rather than fundamental "critical" bugs such as our job db failing to upsert our changes (as this implies something is really wrong with our job db).
Unrecoverable errors will still cause all pools to fail and typically covers completely unexpected events where we should just abort
Over time we will likely change what is deemed recoverable - but from experience what is covered now will handle nearly all loop failures - as they're typically caused by bugs due to bad state being passed to the scheduling code that just errors because it gets into an unknown state.
This feature is disabled by default and is configured with
disableIndependentPoolFailures