Skip to content

fix(validator): remove hostNetwork and privileged from GKE NCCL runtime, use NRI device injection#427

Merged
xdu31 merged 3 commits intoNVIDIA:mainfrom
xdu31:fix/gke-non-hostnetwork-nccl
Mar 18, 2026
Merged

fix(validator): remove hostNetwork and privileged from GKE NCCL runtime, use NRI device injection#427
xdu31 merged 3 commits intoNVIDIA:mainfrom
xdu31:fix/gke-non-hostnetwork-nccl

Conversation

@xdu31
Copy link
Copy Markdown
Contributor

@xdu31 xdu31 commented Mar 18, 2026

Summary

  • Remove hostNetwork: true and privileged: true from GKE NCCL
    TrainingRuntime, matching Excalibur non-privileged configuration
  • Use NRI device injection (devices.gke.io annotation) for tcpxo-daemon
    GPU/DMA device access instead of privileged mode
  • Replace privileged security contexts with minimal capabilities:
    CAP_NET_ADMIN+CAP_NET_BIND_SERVICE for tcpxo-daemon, IPC_LOCK for worker
  • Shift NIC names for pod networking: control NIC eth0, GPU NICs eth1-eth8
    (previously eth1/eth2-eth9 with hostNetwork)
  • Generate NRI device annotation dynamically from GPU count per node
    via buildNRIDeviceAnnotation() instead of hardcoding 8 devices
  • Remove custom SSH port (-p 2222), use default port 22
  • Update rxdm entrypoint args: --num_nics=8 --uid= --alsologtostderr

Motivation

The previous runtime used hostNetwork: true and privileged: true which
caused issues on GKE clusters with extra gVNIC Network CRs. The gVNIC
network interfered with multi-NIC pod injection, causing gpu-nic-7 to be
missing (only 7 of 8 GPU NICs injected). Moving to pod networking with NRI
device injection matches the Excalibur reference configuration and avoids
the gVNIC interference while also improving security posture.

How it works

  1. Worker pods use pod networking (no hostNetwork) with multi-NIC annotation
  2. NRI device injector exposes /dev/nvidia*, /dev/nvidiactl,
    /dev/nvidia-uvm, and /dev/dmabuf_import_helper to tcpxo-daemon
  3. Device list is generated dynamically from GPU_COUNT_PER_NODE
  4. SSH uses default port 22 (no -p 2222 remapping needed without hostNetwork)
  5. NCCL env vars sourced from host nccl-env-profile.sh (unchanged)

Test plan

  • Run NCCL All-Reduce validation on GKE H100 cluster (a3-megagpu-8g)
  • Verify all 8 GPU NICs are injected into worker pods
  • Verify bandwidth ~335 GB/s at 8 GB message size (2 nodes)
  • Verify tcpxo-daemon runs without privileged mode
  • Verify no guest config checker mismatches in launcher logs

@xdu31 xdu31 requested a review from a team as a code owner March 18, 2026 21:15
@xdu31 xdu31 force-pushed the fix/gke-non-hostnetwork-nccl branch from e1dd2ad to c3d9bb5 Compare March 18, 2026 21:18
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Good direction — aligning the validator runtime with the NRI profile. A few issues to address:

Critical

  1. Missing /sys and /proc/sys hostPath mounts on tcpxo-daemon — The runtime removes hostNetwork: true but doesn't add the /sys/hostsysfs and /proc/sys/hostprocsysfs volume mounts. Without these, RxDM will detect 0/8 GPUs — the container namespace hides the PCI sysfs tree. These mounts are what makes the NRI profile work without hostNetwork. See demos/workloads/training/gke-nccl-test-tcpxo.yaml in PR #420 for the working config.

  2. RxDM v1.0.21 with incompatible flags — The entrypoint uses --uid= --alsologtostderr but v1.0.21 uses the older fastrak_gpumem_manager binary which reports ERROR: Unknown command line flag 'enforce_kernel_ipv6_support' style errors. We validated v1.0.20 works with simplified args (--num_hops=2 --num_nics 8). Note: v1.0.21 is also marked as deprecated in the container registry. Recommend updating to tcpgpudmarxd-dev:v1.0.20.

Medium

  1. Capability names should omit CAP_ prefix — The runtime uses CAP_NET_ADMIN, CAP_NET_BIND_SERVICE but the Kubernetes spec expects names without the CAP_ prefix (NET_ADMIN, NET_BIND_SERVICE, IPC_LOCK). GKE's containerd accepts both, but it's not portable and could fail on stricter runtimes or admission controllers.

  2. GPU node pool prerequisite — Worth noting in the PR description or runtime comments that the GPU node pool must not have a gVNIC additional network. A gVNIC takes PCI 0000:06:00.0, displacing one of the 8 GPU NICs and causing a 7/8 GPU detection failure. Root cause details in #381.

Looks good

  • NRI device annotation generation with dynamic GPU count + tests
  • SSH port change from 2222 to default (correct for non-hostNetwork)
  • MPI OOB interface comment update (eth0 is control NIC in pod networking)
  • Worker capability IPC_LOCK instead of privileged

Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments

@xdu31
Copy link
Copy Markdown
Contributor Author

xdu31 commented Mar 18, 2026

Review

Good direction — aligning the validator runtime with the NRI profile. A few issues to address:

Critical

  1. Missing /sys and /proc/sys hostPath mounts on tcpxo-daemon — The runtime removes hostNetwork: true but doesn't add the /sys/hostsysfs and /proc/sys/hostprocsysfs volume mounts. Without these, RxDM will detect 0/8 GPUs — the container namespace hides the PCI sysfs tree. These mounts are what makes the NRI profile work without hostNetwork. See demos/workloads/training/gke-nccl-test-tcpxo.yaml in PR fix(gke): update TCPXO to NRI profile without hostNetwork #420 for the working config.
  2. RxDM v1.0.21 with incompatible flags — The entrypoint uses --uid= --alsologtostderr but v1.0.21 uses the older fastrak_gpumem_manager binary which reports ERROR: Unknown command line flag 'enforce_kernel_ipv6_support' style errors. We validated v1.0.20 works with simplified args (--num_hops=2 --num_nics 8). Note: v1.0.21 is also marked as deprecated in the container registry. Recommend updating to tcpgpudmarxd-dev:v1.0.20.

Medium

  1. Capability names should omit CAP_ prefix — The runtime uses CAP_NET_ADMIN, CAP_NET_BIND_SERVICE but the Kubernetes spec expects names without the CAP_ prefix (NET_ADMIN, NET_BIND_SERVICE, IPC_LOCK). GKE's containerd accepts both, but it's not portable and could fail on stricter runtimes or admission controllers.
  2. GPU node pool prerequisite — Worth noting in the PR description or runtime comments that the GPU node pool must not have a gVNIC additional network. A gVNIC takes PCI 0000:06:00.0, displacing one of the 8 GPU NICs and causing a 7/8 GPU detection failure. Root cause details in #381.

Looks good

  • NRI device annotation generation with dynamic GPU count + tests
  • SSH port change from 2222 to default (correct for non-hostNetwork)
  • MPI OOB interface comment update (eth0 is control NIC in pod networking)
  • Worker capability IPC_LOCK instead of privileged
  1. Missing /sys and /proc/sys mounts

These are already present in the runtime (lines 180-183):

  • name: nvtcpxo-sys
    mountPath: /hostsysfs
  • name: nvtcpxo-proc-sys
    mountPath: /hostprocsysfs
    Backed by hostPath volumes at the bottom of the spec.
  1. RxDM v1.0.21 → v1.0.20

Downgraded to v1.0.20 and removed --uid= --alsologtostderr flags. Reverted entrypoint args to --num_hops=2 --num_nics ${GPU_COUNT_PER_NODE}.

  1. Drop CAP_ prefix

Fixed — changed to NET_ADMIN, NET_BIND_SERVICE, IPC_LOCK.

  1. gVNIC prerequisite

Added comment in runtime header:

  # Prerequisite: GPU node pool must NOT have a gVNIC additional network.
  # A gVNIC network takes a PCI slot, displacing one GPU NIC (7/8 detection).
  # See https://github.com/NVIDIA/aicr/issues/381 for details.
  1. Lint (CI)

Fixed prealloc lint — preallocated lines slice with make([]string, 0, gpuCount+3).

@xdu31 xdu31 requested a review from yuanchen8911 March 18, 2026 22:00
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@xdu31 xdu31 merged commit d99235e into NVIDIA:main Mar 18, 2026
18 checks passed
xdu31 added a commit to xdu31/aicr that referenced this pull request Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants