refactor(validator): unify GKE NCCL to TrainJob+MPI, match EKS pattern#403
refactor(validator): unify GKE NCCL to TrainJob+MPI, match EKS pattern#403mchmarny merged 4 commits intoNVIDIA:mainfrom
Conversation
…th TCPXO sidecar pods
mchmarny
left a comment
There was a problem hiding this comment.
Good work automating the GKE NCCL validation — clean refactor of the dispatch logic and solid test coverage for the new helpers.
Strengths:
- Clean separation of EKS vs GKE runners behind a shared
parseBandwidthFromLogs - Bandwidth regex generalization (last-match strategy) is elegant and handles variable max message sizes well
- Good test coverage:
splitYAMLDocuments,peekKind, GKE template integration test, and GKE 8G bandwidth parsing - Proper cleanup with
defer cleanupGKEResourcesandcontext.Background()for cleanup (correct pattern) - Timeout constants in
pkg/defaultsfollow project conventions
Issues to address:
-
Resource idempotency (Important): Service and Pod creation will fail on re-run after partial failure. Use create-or-update semantics or pre-cleanup stale resources.
-
Pod readiness vs Running phase (Important):
WaitForPodRunningonly checks phase, not container readiness. The TCPXO sidecar must be fully initialized before exec — waiting for Ready condition would be safer. -
Silent default to EKS runner (Moderate): The
defaultswitch case routes unknown services through EKS. Explicit cases with an error default would catch missing runner implementations early. -
Comment contradictions (Minor): The regex comment says "out-of-place busbw" but
parseBandwidthFromLogsdoc says "in-place busbw" — these should agree (it's out-of-place). -
Namespace precondition (Minor): GKE path assumes namespace exists but doesn't create it.
Items 1-2 are the main blockers — the rest are minor improvements.
…y, WaitForPodReady, explicit service dispatch, namespace ensure
Summary
kubectl execto Kubeflow TrainJob + MPI pattern, matching the EKS approachnccl_gke.go(raw Pod flow),nccl_gke_test.go, andnccl-test-tcpxo.yamltemplate — replaced by per-platform TrainingRuntime + shared TrainJobapplyNCCLResources()→ apply runtime → apply trainjob →waitForLauncherCompletion()→ parse bandwidthWaitForPodReady,ExecInContainer) and GKE-specific timeoutsArchitecture
Shared TrainJob with per-platform TrainingRuntimes:
The TrainJob is platform-agnostic — it only sets
runtimeRefandnumNodes. All platform-specific configuration lives in the per-platform TrainingRuntime. The per-platform split is at the runtime level (not TrainJob overrides) because EKS and GKE have fundamentally different GPU networking stacks. The TrainJob API'sspec.podSpecOverridescannot inject native sidecars (initContainerswithrestartPolicy: Always), sethostNetwork: true, or set pod-leveldnsPolicy— all required for GKE TCPXO.EKS vs GKE Runtime Comparison
Both use Kubeflow Trainer MPI plugin with the same test parameters (
-b 1K -e 16G -f 2 -g 1).public.ecr.aws/hpc-cloud/nccl-tests(sshd pre-installed)nvcr.io/nvidia/pytorch:25.06-py3(apt-get install openssh-server)/opt/amazon/openmpi/bin/mpirun/usr/local/mpi/bin/mpirun/opt/nccl-tests/build/${TEST_TYPE}/usr/local/bin/${TEST_TYPE}_mpiFI_PROVIDER=efa,FI_EFA_USE_DEVICE_RDMA=1, 32 EFA devices)NCCL_FASTRAK_*env vars, tcpxo-daemon sidecar)NCCL_NET_GDR_LEVEL=PIXNCCL_IGNORE_DISABLED_P2P=1NCCL_P2P_NVL_CHUNKSIZE,NCCL_P2P_NET_CHUNKSIZE,NCCL_P2P_PCI_CHUNKSIZE,NCCL_NVLSTREE_MAX_CHUNKSIZErestartPolicy: Always)capabilities: [IPC_LOCK]privileged: true(PCI BAR mmap for FastRak)node.kubernetes.io/instance-type: p5.48xlargecloud.google.com/gke-accelerator: nvidia-h100-mega-80gbplm_rsh_agent sshoob_tcp_if_include eth0,eth1,btl_tcp_if_include eth0,eth1,UCX_NET_DEVICES=eth1networking.gke.io/interfaces(8 GPU NICs, discovered at runtime)Config alignment effort
GKE TCPXO required extensive debugging to match EKS quality:
hostNetwork: truebinds host's sshd on port 22, forcing alternate porteth1noteth0, GPU NICs:eth2-eth9noteth1-eth8) — discovered vianccl-env-profile.shNCCL_ALGO,NCCL_DYNAMIC_CHUNK_SIZE,NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY,NCCL_NVLS_ENABLE) flagged as "expected unset"/sys/bus/pci/devices/.../resource0_wc— requiresprivileged: true(not justIPC_LOCK)all_reduce_perf, GKE usesall_reduce_perf_mpi— both MPI-linked, different image conventions. Both now use${TEST_TYPE}template variable-b ${MIN_MESSAGE_SIZE} -e ${MAX_MESSAGE_SIZE} -f 2 -g 1Bandwidth results
GPU hardware is the same on both (8x H100 80GB HBM3). The ~30% gap is explained by 2x network fabric difference (3.2 vs 1.6 Tbps). On 2 nodes the gap is narrower since most all-reduce traffic stays intra-node over NVLink (identical on both) — the fabric difference matters more as node count scales.
Changes
nccl_all_reduce_bw_constraint.gorunNCCLTrainJobpath, adddiscoverGKEGPUNICNetworks()andbuildGKENetworkInterfacesAnnotation(), addwaitForTrainingRuntime()testdata/h100/gke/runtime.yamltestdata/trainjob.yamlh100/eks/— shared TrainJob (just runtimeRef + numNodes)testdata/h100/eks/runtime.yamlnccl_gke.gonccl_gke_test.gotestdata/h100/gke/nccl-test-tcpxo.yamlvalidators/helper/pod.goWaitForPodReady,checkPodReadyOrTerminal,ExecInContainerpkg/defaults/timeouts.goNCCLGKEPodReadyTimeout,NCCLGKEExecTimeoutTest plan
go test -race ./validators/performance/...— all passgo test -race ./validators/helper/...— all passgolangci-lint run— 0 issuesmake tidy— vendor/ synced (removed unused SPDY/websocket/remotecommand deps)