Update conditions for a stale driver ds in the NVIDIADriver controller#1416
Merged
Conversation
This commit fixes #1368. In addition to verifying that DesiredNumberScheduled and NumberMisscheduled are 0 for a DaemonSet, we also verify that the DaemonSet's nodeSelector does not match any nodes before deciding the DaemonSet is stale and deleting it. Without this change, it is possible for the NVIDIADriver controller to enter an endless loop of creating and deleting a DaemonSet. This occurs when 1) The NVIDIADriver daemonset does not tolerate a taint present on all nodes matching its configured nodeSelector, AND 2) None of the DaemonSet pods have been scheduled yet. Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
tariq1890
approved these changes
Apr 24, 2025
jgehrcke
reviewed
Apr 25, 2025
| // #3 was added in response to https://github.com/NVIDIA/gpu-operator/issues/1368 where the NVIDIADriver controller | ||
| // entered an endless loop of creating and deleting a DaemonSet. The DaemonSet's nodeSelector matched one or more nodes, | ||
| // but DesiredNumberScheduled and NumberMisscheduled were both 0 because the DaemonSet did not tolerate a taint on all | ||
| // the nodes. |
There was a problem hiding this comment.
Thanks for the precise and comprehensive commentary. This will save time down the line.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit fixes #1368.
In addition to verifying that DesiredNumberScheduled and NumberMisscheduled are 0 for a DaemonSet, we also verify that the DaemonSet's nodeSelector does not match any nodes before deciding the DaemonSet is stale and deleting it.
Without this change, it is possible for the NVIDIADriver controller to enter an endless loop of creating and deleting a DaemonSet. This occurs when 1) The NVIDIADriver daemonset does not tolerate a taint present on all nodes matching its configured nodeSelector, AND 2) None of the DaemonSet pods have been scheduled yet.
Testing
Reproduce the bug:
Taint all of your nodes
Install GPU Operator and tolerate the taint so that all components can be scheduled. Enable the NVIDIADriver CRD but disable the default CR creation.
Deploy the sample NVIDIADriver CR which does not tolerate the taint:
You will observe a driver DaemonSet repeatedly being created and deleted in the
gpu-operatornamespace. The gpu-operator logs will also confirm this.Verify the fix:
Follow the same steps to reproduce the bug and verify that a driver DaemonSet is created corresponding to the NVIDIADriver CR but it does not get repeatedly deleted. The
DesiredNumberScheduledwill remain 0 for the DaemonSet unless you remove the taint for the node(s) or add the toleration to the NVIDIADriver CR.