Skip to content

Introduce validations for TopologyConstraints in PodCliqueSet#317

Merged
unmarshall merged 4 commits into
ai-dynamo:mainfrom
unmarshall:tas-validations
Jan 14, 2026
Merged

Introduce validations for TopologyConstraints in PodCliqueSet#317
unmarshall merged 4 commits into
ai-dynamo:mainfrom
unmarshall:tas-validations

Conversation

@unmarshall

@unmarshall unmarshall commented Jan 14, 2026

Copy link
Copy Markdown
Collaborator

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds validations for TopologyConstraint when creating/updating PodCliqueSet resources. Existing validating webhook has been enhanced to additionally validate PCS resources for any violations as defined in #304.

Which issue(s) this PR fixes:

Fixes #304

Special notes for your reviewer:

Does this PR introduce a API change?

Enhance PodCliqueSet validating webhook to additionally check correctness of TopologyConstraints.

Additional documentation e.g., enhancement proposals, usage docs, etc.:


* Introduced validations for TopologyConstraints defined at different
  levels in PodCliqueSet.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
especially when you wish to recreate KAI topologies CR to reflect
changes in ClusterTopology CR. KAI topologies are immutable and
therefore needs to be deleted and then created afresh.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
@unmarshall unmarshall merged commit 258e47c into ai-dynamo:main Jan 14, 2026
3 of 6 checks passed

@renormalize renormalize left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: just saw the PR was merged as I was posting my comments. The comments can be addressed in a later PR I suppose.

)

// IsTopologyDomainNarrower returns true if the current TopologyDomain is narrower in scope than the other TopologyDomain.
func (d TopologyDomain) IsTopologyDomainNarrower(other TopologyDomain) bool {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method itself is defined on a topology domain. Including that again in the method name seems redundant and too verbose to me. The suggested name reads more like the stdlib functions.

Suggested change
func (d TopologyDomain) IsTopologyDomainNarrower(other TopologyDomain) bool {
func (d TopologyDomain) IsNarrower(other TopologyDomain) bool {

return fmt.Errorf("failed adding %s webhook handler: %v", defaulting.Name, err)
}
pcsValidatingWebhook := pcsvalidation.NewHandler(mgr)
pcsValidatingWebhook := pcsvalidation.NewHandler(mgr, tasConfig)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we're using slog for logging in this file. Let's remove this dependency and use the logger provided by the handler, like the rest of the codebase. We can also do this in a later clean-up if not now.

Comment on lines +394 to +402

// ---------------------------- Helper Functions ----------------------------

// getDefaultTASConfig returns a default TAS configuration with TAS disabled.
func getDefaultTASConfig() groveconfigv1alpha1.TopologyAwareSchedulingConfiguration {
return groveconfigv1alpha1.TopologyAwareSchedulingConfiguration{
Enabled: false,
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this function if we're only setting Enabled: false which is going to be the case anyway for TopologyAwareSchedulingConfiugration? The default is just a zero value struct instance. This was done in register_test.go.

Suggested change
// ---------------------------- Helper Functions ----------------------------
// getDefaultTASConfig returns a default TAS configuration with TAS disabled.
func getDefaultTASConfig() groveconfigv1alpha1.TopologyAwareSchedulingConfiguration {
return groveconfigv1alpha1.TopologyAwareSchedulingConfiguration{
Enabled: false,
}
}


// validatePodGangTemplateSpecUpdate validates updates to the template specification ensuring immutability of critical fields.
func validatePodGangTemplateSpecUpdate(newSpec, oldSpec *grovecorev1alpha1.PodCliqueSetTemplateSpec, fldPath *field.Path) field.ErrorList {
func (v *pcsValidator) validatePodGangTemplateSpecUpdate(oldPCS *grovecorev1alpha1.PodCliqueSet, fldPath *field.Path) field.ErrorList {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really have a PodGangTemplate anymore after #186. Shouldn't this be:

Suggested change
func (v *pcsValidator) validatePodGangTemplateSpecUpdate(oldPCS *grovecorev1alpha1.PodCliqueSet, fldPath *field.Path) field.ErrorList {
func (v *pcsValidator) validatePodCliqueSetTemplateSpecUpdate(oldPCS *grovecorev1alpha1.PodCliqueSet, fldPath *field.Path) field.ErrorList {

Comment on lines +894 to +901
// defaultTASConfig returns a default TAS configuration with TAS disabled.
// This is used for all podcliqueset validation tests since topology constraint
// validation is tested separately in topologyconstraints_v1_test.go.
func defaultTASConfig() groveconfigv1alpha1.TopologyAwareSchedulingConfiguration {
return groveconfigv1alpha1.TopologyAwareSchedulingConfiguration{
Enabled: false,
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +53 to +55
topologyDomains := lo.Map(tasConfig.Levels, func(level grovecorev1alpha1.TopologyLevel, _ int) string {
return string(level.Domain)
})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done within newTopologyConstraintsValidator, as tasConfig is not used in any other validator.

newTopologyConstraintsValidator can accept a groveconfigv1alpha1.TopologyAwareSchedulingConfiguration instead of []string, and the constructor can map to []string.

pcsValidator would contain tasConfig as is.

}
}

// validate performs validation specific to PodCliqueSet creates.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is being called for both PCS create and updates. pcsValidator.validate() is being invoked in both handler.ValidateCreate() and handler.ValidateUpdate()

Can the comment be corrected?

Also, according to the doc string for disallowConstraintsForCreateWhenTASIsDisabled(), it only has to be invoked during creates, but at this moment it is being invoked for both create and update; this has to be corrected.

func buildClusterTopology(name string, topologyLevels []grovecorev1alpha1.TopologyLevel) *grovecorev1alpha1.ClusterTopology {
sortedTopologyLevels := make([]grovecorev1alpha1.TopologyLevel, len(topologyLevels))
copy(sortedTopologyLevels, topologyLevels)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own education, why did you end up removing this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to know as well.

@gflarity gflarity Jan 15, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tagging @unmarshall for visibility

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add validations for topology constraints on PCS

4 participants