Skip to content

Run nvidia-smi after modules are loaded in driver ds startup probe#1939

Merged
cdesiniotis merged 1 commit into
NVIDIA:mainfrom
cdesiniotis:fix-driver-startup-probe
Nov 26, 2025
Merged

Run nvidia-smi after modules are loaded in driver ds startup probe#1939
cdesiniotis merged 1 commit into
NVIDIA:mainfrom
cdesiniotis:fix-driver-startup-probe

Conversation

@cdesiniotis

Copy link
Copy Markdown
Contributor

This commit eliminates the race condition where the startup probe in the driver daemonset runs after the kernel modules are built (and installed) but before the modules are loaded into the kernel. In this case, the invocation of nvidia-smi (by the startup probe) is what is actually loading the nvidia kernel module and not the modprobe we perform in our driver container scripts. As a result, the nvidia driver will be loaded with a default configuration -- none of the custom kernel module parameters provided by users (via a configmap) or set by our driver container will get applied.

@tariq1890

Copy link
Copy Markdown
Contributor

I've created #1940 to resolve the build failure in the PR pipeline

@karthikvetrivel karthikvetrivel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Chris. I assume you've had the chance to test it under the same circumstances as the original bug?

@shivamerla

Copy link
Copy Markdown
Contributor

Good catch! do we need to add similar checks wherever we are running driver validation (i.e as in the driver-validation init-container of the toolkit)?

This commit eliminates the race condition where the startup probe
in the driver daemonset runs after the kernel modules are built
(and installed) but before the modules are loaded into the kernel.
In this case, the invocation of nvidia-smi (by the startup probe)
is what is actually loading the nvidia kernel module and not the
modprobe we perform in our driver container scripts. As a result,
the nvidia driver will be loaded with a default configuration --
none of the custom kernel module parameters provided by users
(via a configmap) or set by our driver container will get applied.

Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
@cdesiniotis cdesiniotis force-pushed the fix-driver-startup-probe branch from d8d2842 to 90e757a Compare November 26, 2025 16:39
@cdesiniotis

Copy link
Copy Markdown
Contributor Author

do we need to add similar checks wherever we are running driver validation (i.e as in the driver-validation init-container of the toolkit)?

@shivamerla Good question. The driver-validation currently waits on the presence of /run/nvidia/validations/.driver-ctr-ready before running nvidia-smi:

// For driver container installs, check existence of .driver-ctr-ready to confirm running driver
// container has completed and is in Ready state.
func assertDriverContainerReady(silent bool) error {
command := shell
args := []string{"-c", "stat /run/nvidia/validations/.driver-ctr-ready"}
if withWaitFlag {
return runCommandWithWait(command, args, sleepIntervalSecondsFlag, silent)
}
return runCommand(command, args, silent)
}
. Since the .driver-ctr-ready status file is created in the driver pod's startup probe, this race condition does not exist in this case and no changes are required.

@cdesiniotis

Copy link
Copy Markdown
Contributor Author

I assume you've had the chance to test it under the same circumstances as the original bug?

@karthikvetrivel Not yet, but based on my analysis this change is needed regardless. This is definitely a race condition that has always existed. Now whether this change actually resolves the reported issue won't be entirely confirmed until we get access to their environment.

@cdesiniotis

Copy link
Copy Markdown
Contributor Author

/cherry-pick release-25.10

@cdesiniotis cdesiniotis merged commit 51dd7a2 into NVIDIA:main Nov 26, 2025
16 checks passed
@github-actions

Copy link
Copy Markdown
Contributor

🤖 Backport PR created for release-25.10: #1943

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants