Skip to content

Add Scheduler Backend framework#293

Merged
unmarshall merged 32 commits into
ai-dynamo:mainfrom
kangclzjc:scheduler_backend
Apr 3, 2026
Merged

Add Scheduler Backend framework#293
unmarshall merged 32 commits into
ai-dynamo:mainfrom
kangclzjc:scheduler_backend

Conversation

@kangclzjc

@kangclzjc kangclzjc commented Dec 22, 2025

Copy link
Copy Markdown
Contributor

What type of PR is this?

In order to support different scheduler as backends we modify Grove and import scheduler backend interface

What this PR does / why we need it:

In the current PodGang component's sync flow we do the following:

  • Get the list of PodGangs that are expected to be created for the PCS.
  • Check which ones are pending to be created. For each pending PodGang we do the following:
    • Check if all pods have been created for the PodGang.
    • Check if all pods have required PodGang label which adds back reference to the PodGang.
  • If all the above checks are satisfied then it will go ahead and create the PodGang resource.
    So you can see PodGang will be created after pods. However, there is a problem with upcoming Workload API support and kube-scheduler backend.

We don't want break current PodGang working flow. We import this scheduler backend framework to leave the Workload management work to scheduler backend in Grove. For other scheduler, scheduler backend in Grove may manage different CR based on PodGang(Just like KAI, it will create PodGroups. In the future, we will move this management from KAI scheduler to Grove scheduler backend).

To create a Workload object, you will need to create PodGang resource. The PodGang resource cannot be created before the Pods have been created and have a back reference to the PodGang. The issue is that only after the Workload object is created will the kube-scheduler choose to run scoring/filtering plugins to reserve node capacity to schedule this workload PodGroups. The Pods need to have a reference to the Workload object in their spec.

So to accommodate Workload API the flow needs to be changed as below in the PodGang component:

  • Create PodGang with PodGroups(having empty PodReferences as none will exist at this point) and Initialized condition set to False.
  • Creation of PodGang will trigger the creation of the Workload object in the schedulerbackend reconciler which will use the kube scheduler backend.

    This is out of scope of this PR and should be included in the next PR which specifically handles the Workload APi and kube-scheduler.

  • Once all Pod references are updated then set it to true
  • Pods should not lift their scheduling gate till PodGang has Initialized condition set to True. - done in the PCLQ reconciler.

Which issue(s) this PR fixes:

Fixes #275
Fixes #445

Special notes for your reviewer:

Does this PR introduce a API change?

Yes. We will introduce a new API SchedulerBackend

type SchedulerBackend interface {
	// Name is a unique name of the scheduler backend.
	Name() string

	// Init provides a hook to initialize/setup one-time scheduler resources,
	// called at the startup of grove operator.
	Init() error

	// SyncPodGang synchronizes (creates/updates) scheduler specific resources for a PodGang
	// reacting to a creation or update of a PodGang resource.
	SyncPodGang(ctx context.Context, podGang *groveschedulerv1alpha1.PodGang) error

	// OnPodGangDelete cleans up scheduler specific resources for the given PodGang.
	OnPodGangDelete(ctx context.Context, podGang *groveschedulerv1alpha1.PodGang) error

	// PreparePod adds scheduler backend specific configuration to the given Pod object
	// prior to its creation. This includes setting schedulerName, scheduling gates,
	// annotations, etc.
	PreparePod(pod *corev1.Pod)
}

Additional documentation e.g., enhancement proposals, usage docs, etc.:


Comment thread operator/internal/controller/common/component/types.go Outdated
@gflarity

gflarity commented Jan 2, 2026

Copy link
Copy Markdown
Contributor

Not sure I quite understand the goals of this, it's already possible to support different schedulers via the pod specs? (though Kai is the only gang scheduler currently working). I'd suggest kicking off work like this off with a github issue with plenty of detail and a discord discussion as well.

@kangclzjc kangclzjc changed the title Add Scheduler Backend with KAI as default Add Scheduler Backend framework Jan 5, 2026
@kangclzjc

kangclzjc commented Jan 5, 2026

Copy link
Copy Markdown
Contributor Author

Not sure I quite understand the goals of this, it's already possible to support different schedulers via the pod specs? (though Kai is the only gang scheduler currently working). I'd suggest kicking off work like this off with a github issue with plenty of detail and a discord discussion as well.

Sure, Let me create a new issue to introduce this. I can introduce some background here. This is a real request from one of our customer. We have some schedulers which want to integrate Grove. It would be great to have a unify scheduler backend. In that way, we can support other schedulers easily. Since we need to support multiple scheduler as backend especially we need to support k8s 1.34 workload API. Once we have this backend framework we can easily add new scheduler support like default-kube scheduler, Koordinator. In this PR I will only involve scheduler backend framework. For KAI scheduler backend, I won't change the currently workflow that means KAI will still handle podgang and create podgroups/pods.

@kangclzjc kangclzjc force-pushed the scheduler_backend branch 2 times, most recently from 418038f to 206f953 Compare January 8, 2026 01:17
@kangclzjc kangclzjc marked this pull request as ready for review January 8, 2026 03:33

@Ronkahn21 Ronkahn21 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great! A few architectural points to consider:

  • Controller Responsibility: I don’t think the pcs-controller should be updating the PodGang status. Ideally, it should only handle the creation, leaving the podGang-controller to manage its own status.

  • Scaling & Performance: We should discuss the PodGang pod reference fields. Adding this to the pcs-controller increases its complexity. For better scalability, it might be better to let the PodGroup own the pod status before we move toward creating the backend API.

Since the API changes are currently out of scope, we can sync on this later. Amazing job overall, thanks!

Comment thread operator/internal/controller/podclique/components/pod/syncflow.go Outdated
Comment thread operator/internal/controller/podclique/register.go Outdated
Comment thread operator/internal/controller/podcliqueset/components/podgang/syncflow.go Outdated
Comment thread operator/internal/controller/podcliqueset/components/podgang/syncflow.go Outdated
Comment thread operator/internal/controller/podcliqueset/components/podgang/syncflow.go Outdated
Comment thread operator/internal/controller/podcliqueset/components/podgang/syncflow.go Outdated
Comment thread operator/internal/controller/podcliqueset/components/podgang/syncflow.go Outdated
Comment thread operator/internal/controller/podcliqueset/reconcilespec.go Outdated
Comment thread operator/internal/schedulerbackend/kai/backend.go Outdated
@kangclzjc

Copy link
Copy Markdown
Contributor Author

Overall looks great! A few architectural points to consider:

  • Controller Responsibility: I don’t think the pcs-controller should be updating the PodGang status. Ideally, it should only handle the creation, leaving the podGang-controller to manage its own status.
  • Scaling & Performance: We should discuss the PodGang pod reference fields. Adding this to the pcs-controller increases its complexity. For better scalability, it might be better to let the PodGroup own the pod status before we move toward creating the backend API.

Since the API changes are currently out of scope, we can sync on this later. Amazing job overall, thanks!

  1. We don't have PodGang-controller currently, so do you mean add a new podGang-controller?
  2. Actually if we use default kube-scheduler, we won't have PodGroup, so we'd better use pcs-controller to fill the pod reference fields.

@unmarshall unmarshall added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/enhancement Categorizes issue or PR as related to a new feature, enhancement or improvement component/scheduler Issue/PR is for scheduler module component/operator Issue/PR is for grove operator module size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 17, 2026
@kangclzjc kangclzjc force-pushed the scheduler_backend branch 3 times, most recently from aa4ca3b to b0b609c Compare January 29, 2026 03:12

@unmarshall unmarshall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1/n reviews

Comment thread operator/api/config/v1alpha1/types.go Outdated
Comment thread operator/api/config/v1alpha1/types.go Outdated
Comment thread operator/charts/templates/_helpers.tpl Outdated
Comment thread operator/charts/values.yaml Outdated
Comment thread operator/internal/controller/manager.go Outdated
@unmarshall

Copy link
Copy Markdown
Collaborator

@kangclzjc please rebase your PR so that it becomes easier to review this PR.

@unmarshall unmarshall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2/n review comments

Comment thread operator/api/config/v1alpha1/types.go
Comment thread operator/api/config/v1alpha1/types.go
Comment thread operator/api/config/v1alpha1/types.go Outdated
Comment thread operator/api/config/v1alpha1/types.go Outdated
Comment thread operator/api/config/validation/validation.go Outdated
Comment thread operator/charts/values.yaml Outdated
Comment thread operator/charts/values.yaml Outdated
Comment thread operator/cmd/cli/testdata/valid-config-mnnvl-enabled.yaml Outdated
Comment thread operator/cmd/cli/testdata/valid-config.yaml Outdated
Comment thread operator/internal/controller/manager.go Outdated
@kangclzjc kangclzjc force-pushed the scheduler_backend branch 2 times, most recently from 5855c30 to 847e4dc Compare February 1, 2026 05:57
kangclzjc and others added 15 commits April 2, 2026 14:12
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
…tus correctly

Signed-off-by: kangclzjc <kangz@nvidia.com>
… name

Signed-off-by: kangclzjc <kangz@nvidia.com>
… keep all customer set

Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
…alidation

Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
* Changed the value of constant SchedulerNameKube to default-scheduler
* Removed the usage of kube-scheduler and replaced either with backend
  scheduler or default-scheduler based on the context.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
unmarshall and others added 6 commits April 2, 2026 14:45
the patch call is avoided if Initialized condition if already set to true

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
renamed SchedBackend interface to Backend
fixed linting errors

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
…verhead

Signed-off-by: kangclzjc <kangz@nvidia.com>
… functions

* Fixed validation for scheduler configuration.
* GetExistingPodGangs now uses list with label selector.
* syncContext used in PodGang component now does not embed a
  context.Context following golang recommendation.
* Corrected a potential nil pointer dereference in pcs validation webhook.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
unmarshall
unmarshall previously approved these changes Apr 3, 2026
Signed-off-by: kangclzjc <kangz@nvidia.com>
@unmarshall unmarshall merged commit b5621d4 into ai-dynamo:main Apr 3, 2026
14 checks passed
@danbar2 danbar2 mentioned this pull request Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/operator Issue/PR is for grove operator module component/scheduler Issue/PR is for scheduler module kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/enhancement Categorizes issue or PR as related to a new feature, enhancement or improvement size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add scheduler backend framework Add Native Support for Kubernetes Workload API to Enable Gang Scheduling

7 participants