Skip to content

fix: PodCliqueSet webhook crashes on unknown scheduler name#613

Merged
kangclzjc merged 1 commit into
ai-dynamo:mainfrom
weizhoublue:fix/pr-512
May 18, 2026
Merged

fix: PodCliqueSet webhook crashes on unknown scheduler name#613
kangclzjc merged 1 commit into
ai-dynamo:mainfrom
weizhoublue:fix/pr-512

Conversation

@weizhoublue

@weizhoublue weizhoublue commented May 14, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

In a PodCliqueSet YAML, users set spec.template.cliques[].spec.podSpec.schedulerName to choose the scheduler backend. Valid values are "default-scheduler" and "kai-scheduler".
This field is plain corev1.PodSpec.SchedulerName (a string) — it has no CRD-level enum constraint, so the API server does not reject unknown values like "volcano" at the schema layer.

If a user mistakenly sets this to an unsupported value or typo , the validating webhook runs validateSchedulerNames, which detects the mismatch and records a validation error. However, validateSchedulerNames only accumulates errors into a field.ErrorList; it does not return early or prevent the handler from continuing. validatePodCliqueSetWithBackend is called
unconditionally immediately after.

Inside that function, GetOrDefault("kaikai-scheduler") or GetOrDefault("volcano") returns nil because the name is non-empty but not in the registry, and the subsequent backend.ValidatePodCliqueSet() dereferences nil, panicking the webhook process before it can return the recorded validation error to the user.

In short: the webhook does validate the name, but that validation does not stop the code from reaching the nil-deref site.

Which issue(s) this PR fixes:

Fixes #619

Special notes for your reviewer:

Does this PR introduce a API change?


Additional documentation e.g., enhancement proposals, usage docs, etc.:


@copy-pr-bot

copy-pr-bot Bot commented May 14, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread operator/internal/controller/podcliqueset/components/podgang/podgang_test.go Outdated
shayasoolin
shayasoolin previously approved these changes May 14, 2026

@shayasoolin shayasoolin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@sanjaychatterjee sanjaychatterjee left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kangclzjc can you please review this PR? Please ensure the validations are correct.

Comment thread operator/internal/webhook/admission/pcs/validation/handler.go Outdated
Comment thread operator/internal/controller/podcliqueset/components/podgang/podgang.go Outdated
@kangclzjc

Copy link
Copy Markdown
Contributor

@weizhoublue it's a good catch, I left some comments here. And could you also create an issue, and link it in Which issue(s) this PR fixes: here.

Signed-off-by: weizhoublue <weizhou.lan@daocloud.io>

fix: rename test and fix formatting for CI

fix

Signed-off-by: weizhou.lan@daocloud.io <weizhou.lan@daocloud.io>
@weizhoublue

Copy link
Copy Markdown
Contributor Author

@weizhoublue it's a good catch, I left some comments here. And could you also create an issue, and link it in Which issue(s) this PR fixes: here.

sure, #619 was created

@sanjaychatterjee sanjaychatterjee left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Thanks for the contribution @weizhoublue!

@kangclzjc kangclzjc merged commit 7fe6fb9 into ai-dynamo:main May 18, 2026
29 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

webhook crashes on unknown scheduler name

4 participants