Run nvidia-smi after modules are loaded in driver ds startup probe#1939
Conversation
|
I've created #1940 to resolve the build failure in the PR pipeline |
karthikvetrivel
left a comment
There was a problem hiding this comment.
LGTM, Chris. I assume you've had the chance to test it under the same circumstances as the original bug?
|
Good catch! do we need to add similar checks wherever we are running driver validation (i.e as in the driver-validation init-container of the toolkit)? |
This commit eliminates the race condition where the startup probe in the driver daemonset runs after the kernel modules are built (and installed) but before the modules are loaded into the kernel. In this case, the invocation of nvidia-smi (by the startup probe) is what is actually loading the nvidia kernel module and not the modprobe we perform in our driver container scripts. As a result, the nvidia driver will be loaded with a default configuration -- none of the custom kernel module parameters provided by users (via a configmap) or set by our driver container will get applied. Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
d8d2842 to
90e757a
Compare
@shivamerla Good question. The gpu-operator/cmd/nvidia-validator/main.go Lines 667 to 678 in 526cc24 .driver-ctr-ready status file is created in the driver pod's startup probe, this race condition does not exist in this case and no changes are required.
|
@karthikvetrivel Not yet, but based on my analysis this change is needed regardless. This is definitely a race condition that has always existed. Now whether this change actually resolves the reported issue won't be entirely confirmed until we get access to their environment. |
|
/cherry-pick release-25.10 |
|
🤖 Backport PR created for |
This commit eliminates the race condition where the startup probe in the driver daemonset runs after the kernel modules are built (and installed) but before the modules are loaded into the kernel. In this case, the invocation of nvidia-smi (by the startup probe) is what is actually loading the nvidia kernel module and not the modprobe we perform in our driver container scripts. As a result, the nvidia driver will be loaded with a default configuration -- none of the custom kernel module parameters provided by users (via a configmap) or set by our driver container will get applied.