Reconcile PodClique TopologyConstraints#302
Conversation
1a0b687 to
8b60471
Compare
renormalize
left a comment
There was a problem hiding this comment.
1/n. I am going through the code currently, don't want to post all my comments prematurely since the rest of the code might fix my confusion.
renormalize
left a comment
There was a problem hiding this comment.
thanks for this PR! (and apologies for the slow review)
|
@unmarshall I see there's a timeout in the E2E tests. The tests can be tricky to debug so I made sure to include a lot of Debug logging. Take a look at setup.go in the E2E test dir. You can enable debug log level there: That should at least give an idea of why it's timing out. Also, I'm curious if it's timing out in the same spot, or it changes? |
|
When I run it locally it passes sometimes. At other times a random test fails. On GHA it often fails but at different places. |
gflarity
left a comment
There was a problem hiding this comment.
Mostly just suggestions except for the GPU Operator comments/depedencies.yaml comments.
Ronkahn21
left a comment
There was a problem hiding this comment.
One last point about removing the topology type condition when the feature is disabled
* Renamed ClusterTopologyConfiguration to TopologyAwareSchedulingConfiguration in operator config. * Introduced a new condition TopologyLevelsUnavailable on PCS * PackDomain field in corev1alpha1 TopologyConstraint is now required. * When creating ClusterTopology, if host topology level is not defined in TopologyAwareSchedulingConfiguration then the operator will set this level in ClusterTopology as this is a required level. * Adapted PodGang component to set pack constraints at all hierarchy levels. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Introduced a new condition TopologyLevelsUnavailable in PCS status. * Added reconciliation code to update the PCS status condition. * Added missing cluster role to delete KAI topology CR. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
…Gangs Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…uration Signed-off-by: Ron Kahn <rkahn@nvidia.com>
* Added code to update or remove the condition on PCS. * Create utility function for cluster topology with unit test Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Upgraded KAI scheduler version dependency for e2e test to v0.12.0 * Changed polling timeout for e2e tests to 2 mins. * Removed NVIDIA GPU operator to be installed as its not required. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
…trol plane servers for e2e from 3 to 1 Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Moved `synchronizeTopology` in main to clustertopology package. * Adjusted unit tests for clustertopology.go * Removed the previously added delete cluster role for KAI Topology resource. * Removed the code to setup NVIDIA GPU Operator from e2e tests as its not required. * Increased the poll timeout to 4 mins. * Added restartPolicy to Always for Grove operator deployment. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
will be set later after requirements are clear. * Added unit tests for computeExpectedPodGangs function. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Moved GetClusterTopologyLevels to clustertopology package. * Added docstring for buildClusterTopology function. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Added docstrings to computeTopologyLevelsUnavailableCondition and mutateTopologyLevelUnavailableConditions functions. * Changed the docstring for TopologyPackConstraint Preferred and Required fields. * Removed all references to GPU operator from e2e test. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Removed tas and clustertopology samples. These will be added later. * Reworded docstring for PodGang TopologyConstraint. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
…unctions Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
|
I have run the e2e tests locally and they are passing. I have spoken to @sanjaychatterjee and we have decided to merge this PR and not wait for run of e2e tests to complete on github.
|


What type of PR is this?
What this PR does / why we need it:
This PR introduces the following changes to ensure that the TopologyConstraints for topology aware scheduling defined on the PodCliqueSet are reconciled into PodGang.
TopologyAwareSchedulingConfiguration in operator config.
in TopologyAwareSchedulingConfiguration then the operator will set
this level in ClusterTopology as this is a required level.
Which issue(s) this PR fixes:
Fixes #246, #301
Special notes for your reviewer:
Does this PR introduce a API change?
Additional documentation e.g., enhancement proposals, usage docs, etc.: