Skip to content

feat(validator): automate GKE NCCL performance validation with raw Pods + exec #387

@yuanchen8911

Description

@yuanchen8911

Problem

The NCCL performance validator (nccl-all-reduce-bw) only supports EKS via Kubeflow TrainJob. GKE requires a different execution model because GPUDirect TCPXO needs a tcpxo-daemon sidecar per pod and hostNetwork: true, which doesn't fit the TrainJob abstraction.

GKE+H100 is currently in pendingNCCLCombinations and skips with an informative message. Testdata exists at validators/performance/testdata/h100/gke/ but cannot be used by the current automation.

Root Cause

The validator's apply flow is hardcoded to single-GVR resources:

  1. Apply runtime.yaml as a TrainingRuntime
  2. Apply trainjob.yaml as a TrainJob
  3. Wait for TrainJob completion
  4. Extract logs from launcher pod

GKE's runtime.yaml contains multiple resource types (Services + Pods), causing: "the API version in the data (v1) does not match the expected API version (trainer.kubeflow.org/v1alpha1)"

Proposed Fix

Add a GKE execution strategy in validateNcclAllReduceBw:

  1. Multi-resource apply — split YAML by ---, detect each resource's GVR from apiVersion/kind, apply independently
  2. Pod readiness wait — wait for NCCL test pods to be 2/2 Ready (not TrainJob completion)
  3. Exec-based triggerkubectl exec into host-1 to run /scripts/allreduce.sh
  4. Parse output — reuse existing ncclBandwidthRe regex from exec stdout

Branch on service == GKE to use this flow; EKS continues using TrainJob path.

Files

  • validators/performance/nccl_all_reduce_bw_constraint.go — add GKE execution branch
  • validators/performance/testdata/h100/gke/runtime.yaml — already exists (raw Pods + Services)
  • validators/performance/testdata/h100/gke/trainjob.yaml — may be replaced by exec trigger logic

Validation

Manually validated on GKE a3-megagpu-8g (2x H100, COS, K8s 1.35):

  • NCCL AllReduce: 335 GB/s peak busBW, 87.2 GB/s avg
  • Using hostNetwork: true + privileged: true (fallback profile)

Related: #381 (TCPXO hostNetwork requirement), upstream container-engine-accelerators#580

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions