feat(validator): automate GKE NCCL performance validation with raw Pods + exec

## Problem

The NCCL performance validator (`nccl-all-reduce-bw`) only supports EKS via Kubeflow TrainJob. GKE requires a different execution model because GPUDirect TCPXO needs a `tcpxo-daemon` sidecar per pod and `hostNetwork: true`, which doesn't fit the TrainJob abstraction.

GKE+H100 is currently in `pendingNCCLCombinations` and skips with an informative message. Testdata exists at `validators/performance/testdata/h100/gke/` but cannot be used by the current automation.

## Root Cause

The validator's apply flow is hardcoded to single-GVR resources:
1. Apply `runtime.yaml` as a `TrainingRuntime`
2. Apply `trainjob.yaml` as a `TrainJob`
3. Wait for TrainJob completion
4. Extract logs from launcher pod

GKE's `runtime.yaml` contains multiple resource types (Services + Pods), causing: `"the API version in the data (v1) does not match the expected API version (trainer.kubeflow.org/v1alpha1)"`

## Proposed Fix

Add a GKE execution strategy in `validateNcclAllReduceBw`:

1. **Multi-resource apply** — split YAML by `---`, detect each resource's GVR from `apiVersion`/`kind`, apply independently
2. **Pod readiness wait** — wait for NCCL test pods to be 2/2 Ready (not TrainJob completion)
3. **Exec-based trigger** — `kubectl exec` into host-1 to run `/scripts/allreduce.sh`
4. **Parse output** — reuse existing `ncclBandwidthRe` regex from exec stdout

Branch on `service == GKE` to use this flow; EKS continues using TrainJob path.

## Files

- `validators/performance/nccl_all_reduce_bw_constraint.go` — add GKE execution branch
- `validators/performance/testdata/h100/gke/runtime.yaml` — already exists (raw Pods + Services)
- `validators/performance/testdata/h100/gke/trainjob.yaml` — may be replaced by exec trigger logic

## Validation

Manually validated on GKE a3-megagpu-8g (2x H100, COS, K8s 1.35):
- NCCL AllReduce: **335 GB/s** peak busBW, 87.2 GB/s avg
- Using `hostNetwork: true` + `privileged: true` (fallback profile)

Related: #381 (TCPXO hostNetwork requirement), upstream [container-engine-accelerators#580](https://github.com/GoogleCloudPlatform/container-engine-accelerators/issues/580)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(validator): automate GKE NCCL performance validation with raw Pods + exec #387

Problem

Root Cause

Proposed Fix

Files

Validation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(validator): automate GKE NCCL performance validation with raw Pods + exec #387

Description

Problem

Root Cause

Proposed Fix

Files

Validation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions