fix(validator): remove hostNetwork and privileged from GKE NCCL runtime, use NRI device injection by xdu31 · Pull Request #427 · NVIDIA/aicr

xdu31 · 2026-03-18T21:15:12Z

Summary

Remove hostNetwork: true and privileged: true from GKE NCCL
TrainingRuntime, matching Excalibur non-privileged configuration
Use NRI device injection (devices.gke.io annotation) for tcpxo-daemon
GPU/DMA device access instead of privileged mode
Replace privileged security contexts with minimal capabilities:
CAP_NET_ADMIN+CAP_NET_BIND_SERVICE for tcpxo-daemon, IPC_LOCK for worker
Shift NIC names for pod networking: control NIC eth0, GPU NICs eth1-eth8
(previously eth1/eth2-eth9 with hostNetwork)
Generate NRI device annotation dynamically from GPU count per node
via buildNRIDeviceAnnotation() instead of hardcoding 8 devices
Remove custom SSH port (-p 2222), use default port 22
Update rxdm entrypoint args: --num_nics=8 --uid= --alsologtostderr

Motivation

The previous runtime used hostNetwork: true and privileged: true which
caused issues on GKE clusters with extra gVNIC Network CRs. The gVNIC
network interfered with multi-NIC pod injection, causing gpu-nic-7 to be
missing (only 7 of 8 GPU NICs injected). Moving to pod networking with NRI
device injection matches the Excalibur reference configuration and avoids
the gVNIC interference while also improving security posture.

How it works

Worker pods use pod networking (no hostNetwork) with multi-NIC annotation
NRI device injector exposes /dev/nvidia*, /dev/nvidiactl,
/dev/nvidia-uvm, and /dev/dmabuf_import_helper to tcpxo-daemon
Device list is generated dynamically from GPU_COUNT_PER_NODE
SSH uses default port 22 (no -p 2222 remapping needed without hostNetwork)
NCCL env vars sourced from host nccl-env-profile.sh (unchanged)

Test plan

Run NCCL All-Reduce validation on GKE H100 cluster (a3-megagpu-8g)
Verify all 8 GPU NICs are injected into worker pods
Verify bandwidth ~335 GB/s at 8 GB message size (2 nodes)
Verify tcpxo-daemon runs without privileged mode
Verify no guest config checker mismatches in launcher logs

…me, use NRI device injection

yuanchen8911

Review

Good direction — aligning the validator runtime with the NRI profile. A few issues to address:

Critical

Missing /sys and /proc/sys hostPath mounts on tcpxo-daemon — The runtime removes hostNetwork: true but doesn't add the /sys → /hostsysfs and /proc/sys → /hostprocsysfs volume mounts. Without these, RxDM will detect 0/8 GPUs — the container namespace hides the PCI sysfs tree. These mounts are what makes the NRI profile work without hostNetwork. See demos/workloads/training/gke-nccl-test-tcpxo.yaml in PR #420 for the working config.
RxDM v1.0.21 with incompatible flags — The entrypoint uses --uid= --alsologtostderr but v1.0.21 uses the older fastrak_gpumem_manager binary which reports ERROR: Unknown command line flag 'enforce_kernel_ipv6_support' style errors. We validated v1.0.20 works with simplified args (--num_hops=2 --num_nics 8). Note: v1.0.21 is also marked as deprecated in the container registry. Recommend updating to tcpgpudmarxd-dev:v1.0.20.

Medium

Capability names should omit CAP_ prefix — The runtime uses CAP_NET_ADMIN, CAP_NET_BIND_SERVICE but the Kubernetes spec expects names without the CAP_ prefix (NET_ADMIN, NET_BIND_SERVICE, IPC_LOCK). GKE's containerd accepts both, but it's not portable and could fail on stricter runtimes or admission controllers.
GPU node pool prerequisite — Worth noting in the PR description or runtime comments that the GPU node pool must not have a gVNIC additional network. A gVNIC takes PCI 0000:06:00.0, displacing one of the 8 GPU NICs and causing a 7/8 GPU detection failure. Root cause details in #381.

Looks good

NRI device annotation generation with dynamic GPU count + tests
SSH port change from 2222 to default (correct for non-hostNetwork)
MPI OOB interface comment update (eth0 is control NIC in pod networking)
Worker capability IPC_LOCK instead of privileged

yuanchen8911

Left some comments

xdu31 · 2026-03-18T22:00:47Z

Review

Good direction — aligning the validator runtime with the NRI profile. A few issues to address:

Critical

Missing /sys and /proc/sys hostPath mounts on tcpxo-daemon — The runtime removes hostNetwork: true but doesn't add the /sys → /hostsysfs and /proc/sys → /hostprocsysfs volume mounts. Without these, RxDM will detect 0/8 GPUs — the container namespace hides the PCI sysfs tree. These mounts are what makes the NRI profile work without hostNetwork. See demos/workloads/training/gke-nccl-test-tcpxo.yaml in PR fix(gke): update TCPXO to NRI profile without hostNetwork #420 for the working config.

RxDM v1.0.21 with incompatible flags — The entrypoint uses --uid= --alsologtostderr but v1.0.21 uses the older fastrak_gpumem_manager binary which reports ERROR: Unknown command line flag 'enforce_kernel_ipv6_support' style errors. We validated v1.0.20 works with simplified args (--num_hops=2 --num_nics 8). Note: v1.0.21 is also marked as deprecated in the container registry. Recommend updating to tcpgpudmarxd-dev:v1.0.20.

Medium

Capability names should omit CAP_ prefix — The runtime uses CAP_NET_ADMIN, CAP_NET_BIND_SERVICE but the Kubernetes spec expects names without the CAP_ prefix (NET_ADMIN, NET_BIND_SERVICE, IPC_LOCK). GKE's containerd accepts both, but it's not portable and could fail on stricter runtimes or admission controllers.

GPU node pool prerequisite — Worth noting in the PR description or runtime comments that the GPU node pool must not have a gVNIC additional network. A gVNIC takes PCI 0000:06:00.0, displacing one of the 8 GPU NICs and causing a 7/8 GPU detection failure. Root cause details in #381.

Looks good

NRI device annotation generation with dynamic GPU count + tests

SSH port change from 2222 to default (correct for non-hostNetwork)

MPI OOB interface comment update (eth0 is control NIC in pod networking)

Worker capability IPC_LOCK instead of privileged

Missing /sys and /proc/sys mounts

These are already present in the runtime (lines 180-183):

name: nvtcpxo-sys
mountPath: /hostsysfs
name: nvtcpxo-proc-sys
mountPath: /hostprocsysfs
Backed by hostPath volumes at the bottom of the spec.

RxDM v1.0.21 → v1.0.20

Downgraded to v1.0.20 and removed --uid= --alsologtostderr flags. Reverted entrypoint args to --num_hops=2 --num_nics ${GPU_COUNT_PER_NODE}.

Drop CAP_ prefix

Fixed — changed to NET_ADMIN, NET_BIND_SERVICE, IPC_LOCK.

gVNIC prerequisite

Added comment in runtime header:

  # Prerequisite: GPU node pool must NOT have a gVNIC additional network.
  # A gVNIC network takes a PCI slot, displacing one GPU NIC (7/8 detection).
  # See https://github.com/NVIDIA/aicr/issues/381 for details.

Lint (CI)

Fixed prealloc lint — preallocated lines slice with make([]string, 0, gpuCount+3).

yuanchen8911

/lgtm

…me, use NRI device injection (NVIDIA#427)

xdu31 requested a review from a team as a code owner March 18, 2026 21:15

github-actions bot added the size/M label Mar 18, 2026

fix(validator): remove hostNetwork and privileged from GKE NCCL runti…

c3d9bb5

…me, use NRI device injection

xdu31 force-pushed the fix/gke-non-hostnetwork-nccl branch from e1dd2ad to c3d9bb5 Compare March 18, 2026 21:18

yuanchen8911 reviewed Mar 18, 2026

View reviewed changes

yuanchen8911 requested changes Mar 18, 2026

View reviewed changes

fix: review comments and lint failure

b8a84c1

xdu31 requested a review from yuanchen8911 March 18, 2026 22:00

Merge branch 'main' into fix/gke-non-hostnetwork-nccl

093c167

yuanchen8911 approved these changes Mar 18, 2026

View reviewed changes

xdu31 merged commit d99235e into NVIDIA:main Mar 18, 2026
18 checks passed

xdu31 added a commit to xdu31/aicr that referenced this pull request Mar 24, 2026

fix(validator): remove hostNetwork and privileged from GKE NCCL runti…

d5b4b7e

…me, use NRI device injection (NVIDIA#427)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(validator): remove hostNetwork and privileged from GKE NCCL runtime, use NRI device injection#427

fix(validator): remove hostNetwork and privileged from GKE NCCL runtime, use NRI device injection#427
xdu31 merged 3 commits intoNVIDIA:mainfrom
xdu31:fix/gke-non-hostnetwork-nccl

xdu31 commented Mar 18, 2026

Uh oh!

yuanchen8911 left a comment

Uh oh!

yuanchen8911 left a comment

Uh oh!

xdu31 commented Mar 18, 2026

Review

Critical

Medium

Looks good

Uh oh!

yuanchen8911 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xdu31 commented Mar 18, 2026

Summary

Motivation

How it works

Test plan

Uh oh!

yuanchen8911 left a comment

Choose a reason for hiding this comment

Review

Critical

Medium

Looks good

Uh oh!

yuanchen8911 left a comment

Choose a reason for hiding this comment

Uh oh!

xdu31 commented Mar 18, 2026

Review

Critical

Medium

Looks good

Uh oh!

yuanchen8911 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants