Skip to content

Update conditions for a stale driver ds in the NVIDIADriver controller#1416

Merged
tariq1890 merged 1 commit into
mainfrom
fix-nvd-stale-ds-check
May 5, 2025
Merged

Update conditions for a stale driver ds in the NVIDIADriver controller#1416
tariq1890 merged 1 commit into
mainfrom
fix-nvd-stale-ds-check

Conversation

@cdesiniotis

@cdesiniotis cdesiniotis commented Apr 24, 2025

Copy link
Copy Markdown
Contributor

This commit fixes #1368.

In addition to verifying that DesiredNumberScheduled and NumberMisscheduled are 0 for a DaemonSet, we also verify that the DaemonSet's nodeSelector does not match any nodes before deciding the DaemonSet is stale and deleting it.

Without this change, it is possible for the NVIDIADriver controller to enter an endless loop of creating and deleting a DaemonSet. This occurs when 1) The NVIDIADriver daemonset does not tolerate a taint present on all nodes matching its configured nodeSelector, AND 2) None of the DaemonSet pods have been scheduled yet.

Testing

Reproduce the bug:

Taint all of your nodes

$ kubectl taint nodes <node-name> foo=bar:NoSchedule

Install GPU Operator and tolerate the taint so that all components can be scheduled. Enable the NVIDIADriver CRD but disable the default CR creation.

$ cat values.yaml 
daemonsets:
  tolerations:
    - key: "foo"
      operator: "Exists"
      effect: "NoSchedule"

operator:
  tolerations:
    - key: "foo"
      operator: "Exists"
      effect: "NoSchedule"

node-feature-discovery:
  worker:
    tolerations:
      - key: "foo"
        operator: "Exists"
        effect: "NoSchedule"
  master:
    tolerations:
      - key: "foo"
        operator: "Exists"
        effect: "NoSchedule"
  gc:
    tolerations:
      - key: "foo"
        operator: "Exists"
        effect: "NoSchedule"

driver:
  nvidiaDriverCRD:
    enabled: true
    deployDefaultCR: false

$ helm install --generate-name nvidia/gpu-operator -n gpu-operator -f values.yaml

Deploy the sample NVIDIADriver CR which does not tolerate the taint:

$ kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/refs/heads/main/config/samples/nvidia_v1alpha1_nvidiadriver.yaml

You will observe a driver DaemonSet repeatedly being created and deleted in the gpu-operator namespace. The gpu-operator logs will also confirm this.

Verify the fix:

Follow the same steps to reproduce the bug and verify that a driver DaemonSet is created corresponding to the NVIDIADriver CR but it does not get repeatedly deleted. The DesiredNumberScheduled will remain 0 for the DaemonSet unless you remove the taint for the node(s) or add the toleration to the NVIDIADriver CR.

This commit fixes #1368.

In addition to verifying that DesiredNumberScheduled and NumberMisscheduled are 0
for a DaemonSet, we also verify that the DaemonSet's nodeSelector does not match
any nodes before deciding the DaemonSet is stale and deleting it.

Without this change, it is possible for the NVIDIADriver controller to enter an
endless loop of creating and deleting a DaemonSet. This occurs when 1) The NVIDIADriver
daemonset does not tolerate a taint present on all nodes matching its configured nodeSelector,
AND 2) None of the DaemonSet pods have been scheduled yet.

Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>

@shivamerla shivamerla left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread internal/state/driver.go
// #3 was added in response to https://github.com/NVIDIA/gpu-operator/issues/1368 where the NVIDIADriver controller
// entered an endless loop of creating and deleting a DaemonSet. The DaemonSet's nodeSelector matched one or more nodes,
// but DesiredNumberScheduled and NumberMisscheduled were both 0 because the DaemonSet did not tolerate a taint on all
// the nodes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the precise and comprehensive commentary. This will save time down the line.

@tariq1890 tariq1890 merged commit 1ad5393 into main May 5, 2025
@tariq1890 tariq1890 deleted the fix-nvd-stale-ds-check branch May 5, 2025 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NVIDIADriver CRD: Endless Termination Cycle of NVIDIA Driver Pods

4 participants