Skip to content

refactor(validator): unify GKE NCCL to TrainJob+MPI, match EKS pattern#403

Merged
mchmarny merged 4 commits intoNVIDIA:mainfrom
xdu31:feat/gke-nccl
Mar 16, 2026
Merged

refactor(validator): unify GKE NCCL to TrainJob+MPI, match EKS pattern#403
mchmarny merged 4 commits intoNVIDIA:mainfrom
xdu31:feat/gke-nccl

Conversation

@xdu31
Copy link
Copy Markdown
Contributor

@xdu31 xdu31 commented Mar 13, 2026

Summary

  • Rework GKE NCCL all-reduce bandwidth validation from raw Pods + kubectl exec to Kubeflow TrainJob + MPI pattern, matching the EKS approach
  • Delete nccl_gke.go (raw Pod flow), nccl_gke_test.go, and nccl-test-tcpxo.yaml template — replaced by per-platform TrainingRuntime + shared TrainJob
  • Both platforms now share the same code path: applyNCCLResources() → apply runtime → apply trainjob → waitForLauncherCompletion() → parse bandwidth
  • Remove GKE-only helper methods (WaitForPodReady, ExecInContainer) and GKE-specific timeouts
  • Add documentation for testing with custom validator images and private registry authentication

Architecture

Shared TrainJob with per-platform TrainingRuntimes:

testdata/
├── trainjob.yaml              ← shared (runtimeRef + numNodes only)
└── h100/
    ├── eks/
    │   └── runtime.yaml       ← EFA image, EFA mpirun args, p5 nodeSelector
    └── gke/
        └── runtime.yaml       ← TCPXO sidecar, FastRak args, hostNetwork, a3-megagpu nodeSelector

The TrainJob is platform-agnostic — it only sets runtimeRef and numNodes. All platform-specific configuration lives in the per-platform TrainingRuntime. The per-platform split is at the runtime level (not TrainJob overrides) because EKS and GKE have fundamentally different GPU networking stacks. The TrainJob API's spec.podSpecOverrides cannot inject native sidecars (initContainers with restartPolicy: Always), set hostNetwork: true, or set pod-level dnsPolicy — all required for GKE TCPXO.

EKS vs GKE Runtime Comparison

Both use Kubeflow Trainer MPI plugin with the same test parameters (-b 1K -e 16G -f 2 -g 1).

EKS (p5.48xlarge) GKE (a3-megagpu-8g)
Image public.ecr.aws/hpc-cloud/nccl-tests (sshd pre-installed) nvcr.io/nvidia/pytorch:25.06-py3 (apt-get install openssh-server)
mpirun /opt/amazon/openmpi/bin/mpirun /usr/local/mpi/bin/mpirun
Test binary /opt/nccl-tests/build/${TEST_TYPE} /usr/local/bin/${TEST_TYPE}_mpi
Network transport EFA (FI_PROVIDER=efa, FI_EFA_USE_DEVICE_RDMA=1, 32 EFA devices) TCPXO FastRak (18 NCCL_FASTRAK_* env vars, tcpxo-daemon sidecar)
Network bandwidth 3.2 Tbps (32 x 100 Gbps EFA) 1.6 Tbps (8 x 200 Gbps TCPXO)
GPUDirect RDMA Via EFA NCCL_NET_GDR_LEVEL=PIX
NVLink P2P tuning NCCL_IGNORE_DISABLED_P2P=1 NCCL_P2P_NVL_CHUNKSIZE, NCCL_P2P_NET_CHUNKSIZE, NCCL_P2P_PCI_CHUNKSIZE, NCCL_NVLSTREE_MAX_CHUNKSIZE
SSH port 22 (default) 2222 (hostNetwork conflicts with host sshd)
hostNetwork No Yes (GKE upstream #580 — PCI sysfs visibility)
Sidecar None tcpxo-daemon native sidecar (restartPolicy: Always)
securityContext capabilities: [IPC_LOCK] privileged: true (PCI BAR mmap for FastRak)
Node selector node.kubernetes.io/instance-type: p5.48xlarge cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb
MCA/UCX params plm_rsh_agent ssh oob_tcp_if_include eth0,eth1, btl_tcp_if_include eth0,eth1, UCX_NET_DEVICES=eth1
Multi-NIC annotation N/A networking.gke.io/interfaces (8 GPU NICs, discovered at runtime)
Extra volumes None 4 hostPath (nvidia libs, sys, proc/sys, aperture_devices)

Config alignment effort

GKE TCPXO required extensive debugging to match EKS quality:

  • SSH port 2222: hostNetwork: true binds host's sshd on port 22, forcing alternate port
  • MCA/UCX NIC restriction: With hostNetwork, UCX discovers 169.254.x.x GPU NIC link-local addresses that aren't routable cross-node — restricted to control NIC
  • NIC name shift: hostNetwork on a3-megagpu-8g shifts all interfaces by 1 (control: eth1 not eth0, GPU NICs: eth2-eth9 not eth1-eth8) — discovered via nccl-env-profile.sh
  • Guest Config Checker compliance: Removed 4 stale env vars (NCCL_ALGO, NCCL_DYNAMIC_CHUNK_SIZE, NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY, NCCL_NVLS_ENABLE) flagged as "expected unset"
  • PCI BAR access: FastRak plugin needs /sys/bus/pci/devices/.../resource0_wc — requires privileged: true (not just IPC_LOCK)
  • Test binary: EKS uses all_reduce_perf, GKE uses all_reduce_perf_mpi — both MPI-linked, different image conventions. Both now use ${TEST_TYPE} template variable
  • Matched test sweep: Both use identical -b ${MIN_MESSAGE_SIZE} -e ${MAX_MESSAGE_SIZE} -f 2 -g 1

Bandwidth results

  • EKS (p5.48xlarge, 2 nodes, 3.2 Tbps EFA): 485 GB/s busBW
  • GKE (a3-megagpu-8g, 2 nodes, 1.6 Tbps TCPXO): 335 GB/s busBW

GPU hardware is the same on both (8x H100 80GB HBM3). The ~30% gap is explained by 2x network fabric difference (3.2 vs 1.6 Tbps). On 2 nodes the gap is narrower since most all-reduce traffic stays intra-node over NVLink (identical on both) — the fabric difference matters more as node count scales.

Changes

File Change
nccl_all_reduce_bw_constraint.go Remove service dispatch, unify to single runNCCLTrainJob path, add discoverGKEGPUNICNetworks() and buildGKENetworkInterfacesAnnotation(), add waitForTrainingRuntime()
testdata/h100/gke/runtime.yaml New — GKE TrainingRuntime with TCPXO sidecar, FastRak env vars, hostNetwork, multi-NIC annotations
testdata/trainjob.yaml Moved from h100/eks/ — shared TrainJob (just runtimeRef + numNodes)
testdata/h100/eks/runtime.yaml Comment updates (header, sync note)
nccl_gke.go Deleted — raw Pod approach (294 lines)
nccl_gke_test.go Deleted — tests for deleted code (151 lines)
testdata/h100/gke/nccl-test-tcpxo.yaml Deleted — raw Pod template (221 lines)
validators/helper/pod.go Remove WaitForPodReady, checkPodReadyOrTerminal, ExecInContainer
pkg/defaults/timeouts.go Remove NCCLGKEPodReadyTimeout, NCCLGKEExecTimeout

Test plan

  • go test -race ./validators/performance/... — all pass
  • go test -race ./validators/helper/... — all pass
  • golangci-lint run — 0 issues
  • make tidy — vendor/ synced (removed unused SPDY/websocket/remotecommand deps)
  • E2E on EKS H100 cluster (p5.48xlarge, 2 nodes): 485 GB/s busBW — PASS
  • E2E on GKE H100 cluster (a3-megagpu-8g, 2 nodes): 335 GB/s busBW — PASS

@xdu31 xdu31 requested a review from a team as a code owner March 13, 2026 20:53
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work automating the GKE NCCL validation — clean refactor of the dispatch logic and solid test coverage for the new helpers.

Strengths:

  • Clean separation of EKS vs GKE runners behind a shared parseBandwidthFromLogs
  • Bandwidth regex generalization (last-match strategy) is elegant and handles variable max message sizes well
  • Good test coverage: splitYAMLDocuments, peekKind, GKE template integration test, and GKE 8G bandwidth parsing
  • Proper cleanup with defer cleanupGKEResources and context.Background() for cleanup (correct pattern)
  • Timeout constants in pkg/defaults follow project conventions

Issues to address:

  1. Resource idempotency (Important): Service and Pod creation will fail on re-run after partial failure. Use create-or-update semantics or pre-cleanup stale resources.

  2. Pod readiness vs Running phase (Important): WaitForPodRunning only checks phase, not container readiness. The TCPXO sidecar must be fully initialized before exec — waiting for Ready condition would be safer.

  3. Silent default to EKS runner (Moderate): The default switch case routes unknown services through EKS. Explicit cases with an error default would catch missing runner implementations early.

  4. Comment contradictions (Minor): The regex comment says "out-of-place busbw" but parseBandwidthFromLogs doc says "in-place busbw" — these should agree (it's out-of-place).

  5. Namespace precondition (Minor): GKE path assumes namespace exists but doesn't create it.

Items 1-2 are the main blockers — the rest are minor improvements.

@mchmarny mchmarny added enhancement New feature or request area/validator labels Mar 13, 2026
@mchmarny mchmarny added this to the M2 - KubeCon EU milestone Mar 13, 2026
…y, WaitForPodReady, explicit service dispatch, namespace ensure
@xdu31 xdu31 changed the title feat(validator): automate GKE NCCL all-reduce bandwidth validation with TCPXO sidecar pods refactor(validator): unify GKE NCCL to TrainJob+MPI, match EKS pattern Mar 14, 2026
@mchmarny mchmarny merged commit 06c7428 into NVIDIA:main Mar 16, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants