fix(validator): remove hostNetwork and privileged from GKE NCCL runtime, use NRI device injection#427
Conversation
…me, use NRI device injection
e1dd2ad to
c3d9bb5
Compare
yuanchen8911
left a comment
There was a problem hiding this comment.
Review
Good direction — aligning the validator runtime with the NRI profile. A few issues to address:
Critical
-
Missing
/sysand/proc/syshostPath mounts on tcpxo-daemon — The runtime removeshostNetwork: truebut doesn't add the/sys→/hostsysfsand/proc/sys→/hostprocsysfsvolume mounts. Without these, RxDM will detect 0/8 GPUs — the container namespace hides the PCI sysfs tree. These mounts are what makes the NRI profile work without hostNetwork. Seedemos/workloads/training/gke-nccl-test-tcpxo.yamlin PR #420 for the working config. -
RxDM v1.0.21 with incompatible flags — The entrypoint uses
--uid= --alsologtostderrbut v1.0.21 uses the olderfastrak_gpumem_managerbinary which reportsERROR: Unknown command line flag 'enforce_kernel_ipv6_support'style errors. We validated v1.0.20 works with simplified args (--num_hops=2 --num_nics 8). Note: v1.0.21 is also marked as deprecated in the container registry. Recommend updating totcpgpudmarxd-dev:v1.0.20.
Medium
-
Capability names should omit
CAP_prefix — The runtime usesCAP_NET_ADMIN,CAP_NET_BIND_SERVICEbut the Kubernetes spec expects names without theCAP_prefix (NET_ADMIN,NET_BIND_SERVICE,IPC_LOCK). GKE's containerd accepts both, but it's not portable and could fail on stricter runtimes or admission controllers. -
GPU node pool prerequisite — Worth noting in the PR description or runtime comments that the GPU node pool must not have a gVNIC additional network. A gVNIC takes PCI
0000:06:00.0, displacing one of the 8 GPU NICs and causing a 7/8 GPU detection failure. Root cause details in #381.
Looks good
- NRI device annotation generation with dynamic GPU count + tests
- SSH port change from 2222 to default (correct for non-hostNetwork)
- MPI OOB interface comment update (eth0 is control NIC in pod networking)
- Worker capability
IPC_LOCKinstead of privileged
These are already present in the runtime (lines 180-183):
Downgraded to v1.0.20 and removed --uid= --alsologtostderr flags. Reverted entrypoint args to --num_hops=2 --num_nics ${GPU_COUNT_PER_NODE}.
Fixed — changed to NET_ADMIN, NET_BIND_SERVICE, IPC_LOCK.
Added comment in runtime header:
Fixed prealloc lint — preallocated lines slice with make([]string, 0, gpuCount+3). |
…me, use NRI device injection (NVIDIA#427)
Summary
hostNetwork: trueandprivileged: truefrom GKE NCCLTrainingRuntime, matching Excalibur non-privileged configuration
devices.gke.ioannotation) for tcpxo-daemonGPU/DMA device access instead of privileged mode
CAP_NET_ADMIN+CAP_NET_BIND_SERVICEfor tcpxo-daemon,IPC_LOCKfor workereth0, GPU NICseth1-eth8(previously
eth1/eth2-eth9with hostNetwork)via
buildNRIDeviceAnnotation()instead of hardcoding 8 devices-p 2222), use default port 22--num_nics=8 --uid= --alsologtostderrMotivation
The previous runtime used
hostNetwork: trueandprivileged: truewhichcaused issues on GKE clusters with extra gVNIC Network CRs. The gVNIC
network interfered with multi-NIC pod injection, causing gpu-nic-7 to be
missing (only 7 of 8 GPU NICs injected). Moving to pod networking with NRI
device injection matches the Excalibur reference configuration and avoids
the gVNIC interference while also improving security posture.
How it works
/dev/nvidia*,/dev/nvidiactl,/dev/nvidia-uvm, and/dev/dmabuf_import_helperto tcpxo-daemonGPU_COUNT_PER_NODE-p 2222remapping needed without hostNetwork)nccl-env-profile.sh(unchanged)Test plan