feat(recipes): add GKE COS training overlays for H100#383
feat(recipes): add GKE COS training overlays for H100#383yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
Conversation
There was a problem hiding this comment.
Trivy found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
3112e5a to
8c099b7
Compare
94eabbe to
35cbc17
Compare
cdef847 to
8c2b23c
Compare
e01f470 to
006d12b
Compare
006d12b to
dc517cd
Compare
|
Would it be worth doing a spot check on this branch on EKS? Or have you already done that? I couldn’t tell from PR. How confident are we with the E2E tests in CI covering for possible regressions? |
Review: GKE COS Training Overlays for H100Overall: Well-structured, additive PR. Good recipe chain design. A few issues to address. CI Status
Critical1. Breaking change to The Similarly, This is a behavioral regression for existing deployments, not just a GKE-additive change. Suggestion: Either revert the defaults in no-op.yaml back to Important2. Duplicated cluster-detection logic in The provider detection + cluster description logic ( 3. Typo in Go skip message In 4. The trigger Job uses Minor5. 6. 7. Evidence collection silently drops sections 8. Strengths
|
validate on cuj1 and cuj2 using the branch. No regression |
mchmarny
left a comment
There was a problem hiding this comment.
The key blockers:
- Breaking change to no-op.yaml defaults — affects existing EKS deployments
- Duplicated cluster-detection logic in collect-evidence.sh
- fmt.Sprintf used for error-like return in Go code
- Demo YAML duplication (gke-nccl-test-tcpxo.yaml)
- bitnami/kubectl:latest in trainjob.yaml
dc517cd to
5033975
Compare
Add complete GKE Container-Optimized OS (COS) training recipe chain with GPUDirect TCPXO networking, NRI device injection, and Kubeflow Trainer support. Recipe chain: base → gke-cos → gke-cos-training → h100-gke-cos-training → h100-gke-cos-training-kubeflow New overlays: - gke-cos-training: GKE COS + training intent with GPU Operator values - h100-gke-cos-training: H100-specific with TCPXO, NRI, skyhook tuning - h100-gke-cos-training-kubeflow: adds Kubeflow Trainer for TrainJob New components: - gke-nccl-tcpxo: NCCL TCPXO installer + NRI device injector manifests - gpu-operator/values-gke-cos-training.yaml: training GPU Operator values - gpu-operator/manifests/gke-resource-quota.yaml: system-critical quota - skyhook-customizations/manifests/tuning-gke.yaml: COS kernel tuning Validator changes: - Skip GKE NCCL performance test with informative warning (not yet automated; requires raw Pods with TCPXO sidecar) - GKE H100 testdata added for manual execution Evidence collection: - Auto-detect cluster description from node metadata instead of hardcoded recipe name Also includes: - Demo workloads for GKE NCCL TCPXO benchmark and CUJ1 guide - Fix vllm-agg tolerations/nodeSelectors to match AICR convention - Skyhook no-op runtimeRequired/autoTaintNewNodes default to false Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
5033975 to
4fd364d
Compare
|
@mchmarny Thanks for the review! Here's how I addressed each item:
|
Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Summary
Add complete GKE Container-Optimized OS (COS) training recipe chain with GPUDirect TCPXO networking, NRI device injection, and Kubeflow Trainer support.
Motivation / Context
Enable AICR recipe generation for GKE clusters running COS with H100 GPUs (a3-megagpu-8g). This is the GKE equivalent of the existing EKS Ubuntu training overlays.
Fixes: N/A
Related: #380, #381, #343
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
Recipe chain:
base → gke-cos → gke-cos-training → h100-gke-cos-training → h100-gke-cos-training-kubeflowNew overlays:
gke-cos-training— GKE COS + training intent with GPU Operator valuesh100-gke-cos-training— H100-specific with TCPXO, NRI, skyhook COS tuningh100-gke-cos-training-kubeflow— adds Kubeflow Trainer for TrainJobNew components:
gke-nccl-tcpxo— NCCL TCPXO installer (v1.0.15) + NRI device injector manifestsgpu-operator/values-gke-cos-training.yaml— training-optimized GPU Operator valuesgpu-operator/manifests/gke-resource-quota.yaml— system-critical ResourceQuota for GKEskyhook-customizations/manifests/tuning-gke.yaml— COS kernel sysctl tuningValidator changes:
Evidence collection:
Known limitations:
hostNetwork: truefor full PCI sysfs visibility (upstream: GPUDirect-TCPXO daemon sees incomplete PCI GPU inventory without hostNetwork on GKE A3 Mega GoogleCloudPlatform/container-engine-accelerators#580)Testing
Deployed and validated on GKE a3-megagpu-8g cluster with 2 H100 nodes. NCCL AllReduce ~335 GB/s.
Risk Assessment
Rollout notes: N/A — new recipe chain, no impact on existing recipes
Checklist
make testwith-race)make lint)git commit -S) — GPG signing info