-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Problem
The NCCL performance validator (nccl-all-reduce-bw) only supports EKS via Kubeflow TrainJob. GKE requires a different execution model because GPUDirect TCPXO needs a tcpxo-daemon sidecar per pod and hostNetwork: true, which doesn't fit the TrainJob abstraction.
GKE+H100 is currently in pendingNCCLCombinations and skips with an informative message. Testdata exists at validators/performance/testdata/h100/gke/ but cannot be used by the current automation.
Root Cause
The validator's apply flow is hardcoded to single-GVR resources:
- Apply
runtime.yamlas aTrainingRuntime - Apply
trainjob.yamlas aTrainJob - Wait for TrainJob completion
- Extract logs from launcher pod
GKE's runtime.yaml contains multiple resource types (Services + Pods), causing: "the API version in the data (v1) does not match the expected API version (trainer.kubeflow.org/v1alpha1)"
Proposed Fix
Add a GKE execution strategy in validateNcclAllReduceBw:
- Multi-resource apply — split YAML by
---, detect each resource's GVR fromapiVersion/kind, apply independently - Pod readiness wait — wait for NCCL test pods to be 2/2 Ready (not TrainJob completion)
- Exec-based trigger —
kubectl execinto host-1 to run/scripts/allreduce.sh - Parse output — reuse existing
ncclBandwidthReregex from exec stdout
Branch on service == GKE to use this flow; EKS continues using TrainJob path.
Files
validators/performance/nccl_all_reduce_bw_constraint.go— add GKE execution branchvalidators/performance/testdata/h100/gke/runtime.yaml— already exists (raw Pods + Services)validators/performance/testdata/h100/gke/trainjob.yaml— may be replaced by exec trigger logic
Validation
Manually validated on GKE a3-megagpu-8g (2x H100, COS, K8s 1.35):
- NCCL AllReduce: 335 GB/s peak busBW, 87.2 GB/s avg
- Using
hostNetwork: true+privileged: true(fallback profile)
Related: #381 (TCPXO hostNetwork requirement), upstream container-engine-accelerators#580