Skip to content

[release-25.10] Run nvidia-smi after modules are loaded in driver ds startup probe#1943

Merged
cdesiniotis merged 1 commit into
release-25.10from
backport-1939-to-release-25.10
Nov 26, 2025
Merged

[release-25.10] Run nvidia-smi after modules are loaded in driver ds startup probe#1943
cdesiniotis merged 1 commit into
release-25.10from
backport-1939-to-release-25.10

Conversation

@github-actions

Copy link
Copy Markdown
Contributor

🤖 Automated backport of #1939 to release-25.10

✅ Cherry-pick completed successfully with no conflicts.

Original PR: #1939
Original Author: @cdesiniotis

Cherry-picked commits (1):

  • 90e757a Run nvidia-smi after modules are loaded in driver ds startup probe

This backport was automatically created by the backport bot.

This commit eliminates the race condition where the startup probe
in the driver daemonset runs after the kernel modules are built
(and installed) but before the modules are loaded into the kernel.
In this case, the invocation of nvidia-smi (by the startup probe)
is what is actually loading the nvidia kernel module and not the
modprobe we perform in our driver container scripts. As a result,
the nvidia driver will be loaded with a default configuration --
none of the custom kernel module parameters provided by users
(via a configmap) or set by our driver container will get applied.

Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
(cherry picked from commit 90e757a)
@copy-pr-bot

copy-pr-bot Bot commented Nov 26, 2025

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cdesiniotis

Copy link
Copy Markdown
Contributor

/ok to test ea6ecd5

@coveralls

Copy link
Copy Markdown

Coverage Status

coverage: 22.895%. remained the same
when pulling ea6ecd5 on backport-1939-to-release-25.10
into fb0ae21 on release-25.10.

@cdesiniotis cdesiniotis merged commit f55e7fb into release-25.10 Nov 26, 2025
17 checks passed
@tariq1890 tariq1890 deleted the backport-1939-to-release-25.10 branch November 26, 2025 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants