refactor(validator): unify GKE NCCL to TrainJob+MPI, match EKS pattern by xdu31 · Pull Request #403 · NVIDIA/aicr

xdu31 · 2026-03-13T20:53:47Z

Summary

Rework GKE NCCL all-reduce bandwidth validation from raw Pods + kubectl exec to Kubeflow TrainJob + MPI pattern, matching the EKS approach
Delete nccl_gke.go (raw Pod flow), nccl_gke_test.go, and nccl-test-tcpxo.yaml template — replaced by per-platform TrainingRuntime + shared TrainJob
Both platforms now share the same code path: applyNCCLResources() → apply runtime → apply trainjob → waitForLauncherCompletion() → parse bandwidth
Remove GKE-only helper methods (WaitForPodReady, ExecInContainer) and GKE-specific timeouts
Add documentation for testing with custom validator images and private registry authentication

Architecture

Shared TrainJob with per-platform TrainingRuntimes:

testdata/
├── trainjob.yaml              ← shared (runtimeRef + numNodes only)
└── h100/
    ├── eks/
    │   └── runtime.yaml       ← EFA image, EFA mpirun args, p5 nodeSelector
    └── gke/
        └── runtime.yaml       ← TCPXO sidecar, FastRak args, hostNetwork, a3-megagpu nodeSelector

The TrainJob is platform-agnostic — it only sets runtimeRef and numNodes. All platform-specific configuration lives in the per-platform TrainingRuntime. The per-platform split is at the runtime level (not TrainJob overrides) because EKS and GKE have fundamentally different GPU networking stacks. The TrainJob API's spec.podSpecOverrides cannot inject native sidecars (initContainers with restartPolicy: Always), set hostNetwork: true, or set pod-level dnsPolicy — all required for GKE TCPXO.

EKS vs GKE Runtime Comparison

Both use Kubeflow Trainer MPI plugin with the same test parameters (-b 1K -e 16G -f 2 -g 1).

	EKS (p5.48xlarge)	GKE (a3-megagpu-8g)
Image	`public.ecr.aws/hpc-cloud/nccl-tests` (sshd pre-installed)	`nvcr.io/nvidia/pytorch:25.06-py3` (apt-get install openssh-server)
mpirun	`/opt/amazon/openmpi/bin/mpirun`	`/usr/local/mpi/bin/mpirun`
Test binary	`/opt/nccl-tests/build/${TEST_TYPE}`	`/usr/local/bin/${TEST_TYPE}_mpi`
Network transport	EFA (`FI_PROVIDER=efa`, `FI_EFA_USE_DEVICE_RDMA=1`, 32 EFA devices)	TCPXO FastRak (18 `NCCL_FASTRAK_*` env vars, tcpxo-daemon sidecar)
Network bandwidth	3.2 Tbps (32 x 100 Gbps EFA)	1.6 Tbps (8 x 200 Gbps TCPXO)
GPUDirect RDMA	Via EFA	`NCCL_NET_GDR_LEVEL=PIX`
NVLink P2P tuning	`NCCL_IGNORE_DISABLED_P2P=1`	`NCCL_P2P_NVL_CHUNKSIZE`, `NCCL_P2P_NET_CHUNKSIZE`, `NCCL_P2P_PCI_CHUNKSIZE`, `NCCL_NVLSTREE_MAX_CHUNKSIZE`
SSH port	22 (default)	2222 (hostNetwork conflicts with host sshd)
hostNetwork	No	Yes (GKE upstream #580 — PCI sysfs visibility)
Sidecar	None	tcpxo-daemon native sidecar (`restartPolicy: Always`)
securityContext	`capabilities: [IPC_LOCK]`	`privileged: true` (PCI BAR mmap for FastRak)
Node selector	`node.kubernetes.io/instance-type: p5.48xlarge`	`cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb`
MCA/UCX params	`plm_rsh_agent ssh`	`oob_tcp_if_include eth0,eth1`, `btl_tcp_if_include eth0,eth1`, `UCX_NET_DEVICES=eth1`
Multi-NIC annotation	N/A	`networking.gke.io/interfaces` (8 GPU NICs, discovered at runtime)
Extra volumes	None	4 hostPath (nvidia libs, sys, proc/sys, aperture_devices)

Config alignment effort

GKE TCPXO required extensive debugging to match EKS quality:

SSH port 2222: hostNetwork: true binds host's sshd on port 22, forcing alternate port
MCA/UCX NIC restriction: With hostNetwork, UCX discovers 169.254.x.x GPU NIC link-local addresses that aren't routable cross-node — restricted to control NIC
NIC name shift: hostNetwork on a3-megagpu-8g shifts all interfaces by 1 (control: eth1 not eth0, GPU NICs: eth2-eth9 not eth1-eth8) — discovered via nccl-env-profile.sh
Guest Config Checker compliance: Removed 4 stale env vars (NCCL_ALGO, NCCL_DYNAMIC_CHUNK_SIZE, NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY, NCCL_NVLS_ENABLE) flagged as "expected unset"
PCI BAR access: FastRak plugin needs /sys/bus/pci/devices/.../resource0_wc — requires privileged: true (not just IPC_LOCK)
Test binary: EKS uses all_reduce_perf, GKE uses all_reduce_perf_mpi — both MPI-linked, different image conventions. Both now use ${TEST_TYPE} template variable
Matched test sweep: Both use identical -b ${MIN_MESSAGE_SIZE} -e ${MAX_MESSAGE_SIZE} -f 2 -g 1

Bandwidth results

EKS (p5.48xlarge, 2 nodes, 3.2 Tbps EFA): 485 GB/s busBW
GKE (a3-megagpu-8g, 2 nodes, 1.6 Tbps TCPXO): 335 GB/s busBW

GPU hardware is the same on both (8x H100 80GB HBM3). The ~30% gap is explained by 2x network fabric difference (3.2 vs 1.6 Tbps). On 2 nodes the gap is narrower since most all-reduce traffic stays intra-node over NVLink (identical on both) — the fabric difference matters more as node count scales.

Changes

File	Change
`nccl_all_reduce_bw_constraint.go`	Remove service dispatch, unify to single `runNCCLTrainJob` path, add `discoverGKEGPUNICNetworks()` and `buildGKENetworkInterfacesAnnotation()`, add `waitForTrainingRuntime()`
`testdata/h100/gke/runtime.yaml`	New — GKE TrainingRuntime with TCPXO sidecar, FastRak env vars, hostNetwork, multi-NIC annotations
`testdata/trainjob.yaml`	Moved from `h100/eks/` — shared TrainJob (just runtimeRef + numNodes)
`testdata/h100/eks/runtime.yaml`	Comment updates (header, sync note)
`nccl_gke.go`	Deleted — raw Pod approach (294 lines)
`nccl_gke_test.go`	Deleted — tests for deleted code (151 lines)
`testdata/h100/gke/nccl-test-tcpxo.yaml`	Deleted — raw Pod template (221 lines)
`validators/helper/pod.go`	Remove `WaitForPodReady`, `checkPodReadyOrTerminal`, `ExecInContainer`
`pkg/defaults/timeouts.go`	Remove `NCCLGKEPodReadyTimeout`, `NCCLGKEExecTimeout`

Test plan

go test -race ./validators/performance/... — all pass
go test -race ./validators/helper/... — all pass
golangci-lint run — 0 issues
make tidy — vendor/ synced (removed unused SPDY/websocket/remotecommand deps)
E2E on EKS H100 cluster (p5.48xlarge, 2 nodes): 485 GB/s busBW — PASS
E2E on GKE H100 cluster (a3-megagpu-8g, 2 nodes): 335 GB/s busBW — PASS

…th TCPXO sidecar pods

mchmarny

Good work automating the GKE NCCL validation — clean refactor of the dispatch logic and solid test coverage for the new helpers.

Strengths:

Clean separation of EKS vs GKE runners behind a shared parseBandwidthFromLogs
Bandwidth regex generalization (last-match strategy) is elegant and handles variable max message sizes well
Good test coverage: splitYAMLDocuments, peekKind, GKE template integration test, and GKE 8G bandwidth parsing
Proper cleanup with defer cleanupGKEResources and context.Background() for cleanup (correct pattern)
Timeout constants in pkg/defaults follow project conventions

Issues to address:

Resource idempotency (Important): Service and Pod creation will fail on re-run after partial failure. Use create-or-update semantics or pre-cleanup stale resources.
Pod readiness vs Running phase (Important): WaitForPodRunning only checks phase, not container readiness. The TCPXO sidecar must be fully initialized before exec — waiting for Ready condition would be safer.
Silent default to EKS runner (Moderate): The default switch case routes unknown services through EKS. Explicit cases with an error default would catch missing runner implementations early.
Comment contradictions (Minor): The regex comment says "out-of-place busbw" but parseBandwidthFromLogs doc says "in-place busbw" — these should agree (it's out-of-place).
Namespace precondition (Minor): GKE path assumes namespace exists but doesn't create it.

Items 1-2 are the main blockers — the rest are minor improvements.

validators/performance/nccl_gke.go

validators/performance/nccl_all_reduce_bw_constraint.go

validators/performance/nccl_gke.go

…y, WaitForPodReady, explicit service dispatch, namespace ensure

…pproach

NVIDIA#403)

feat(validator): automate GKE NCCL all-reduce bandwidth validation wi…

6f68935

…th TCPXO sidecar pods

xdu31 requested a review from a team as a code owner March 13, 2026 20:53

github-actions bot added the size/XL label Mar 13, 2026

mchmarny requested changes Mar 13, 2026

View reviewed changes

mchmarny assigned xdu31 Mar 13, 2026

mchmarny added enhancement New feature or request area/validator labels Mar 13, 2026

mchmarny added this to the M2 - KubeCon EU milestone Mar 13, 2026

fix(validator): address PR NVIDIA#403 review — pre-cleanup idempotenc…

80e26cb

…y, WaitForPodReady, explicit service dispatch, namespace ensure

xdu31 changed the title ~~feat(validator): automate GKE NCCL all-reduce bandwidth validation with TCPXO sidecar pods~~ refactor(validator): unify GKE NCCL to TrainJob+MPI, match EKS pattern Mar 14, 2026

github-actions bot removed the area/validator label Mar 14, 2026

xdu31 force-pushed the feat/gke-nccl branch from bf63b4f to 38b57fd Compare March 14, 2026 17:10

refactor(validator): unify GKE NCCL to TrainJob+MPI, remove raw Pod a…

e13da0d

…pproach

xdu31 force-pushed the feat/gke-nccl branch from 38b57fd to e13da0d Compare March 14, 2026 18:04

Merge branch 'main' into feat/gke-nccl

0472fed

mchmarny approved these changes Mar 16, 2026

View reviewed changes

mchmarny merged commit 06c7428 into NVIDIA:main Mar 16, 2026
21 checks passed

This was referenced Mar 16, 2026

feat(validator): automate GKE NCCL performance validation with raw Pods + exec #387

Closed

build: release v0.10.16 #411

Closed

ci: build and publish validator images on merge to main #412

Merged

xdu31 added a commit to xdu31/aicr that referenced this pull request Mar 24, 2026

refactor(validator): unify GKE NCCL to TrainJob+MPI, match EKS pattern (

2805a55

NVIDIA#403)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(validator): unify GKE NCCL to TrainJob+MPI, match EKS pattern#403

refactor(validator): unify GKE NCCL to TrainJob+MPI, match EKS pattern#403
mchmarny merged 4 commits intoNVIDIA:mainfrom
xdu31:feat/gke-nccl

xdu31 commented Mar 13, 2026 •

edited

Loading

Uh oh!

mchmarny left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xdu31 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

EKS vs GKE Runtime Comparison

Config alignment effort

Bandwidth results

Changes

Test plan

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xdu31 commented Mar 13, 2026 •

edited

Loading