Skip to content

feat(recipes): add GKE COS training overlays for H100#383

Merged
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:feat/gke-cos-training-overlays
Mar 12, 2026
Merged

feat(recipes): add GKE COS training overlays for H100#383
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:feat/gke-cos-training-overlays

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

Add complete GKE Container-Optimized OS (COS) training recipe chain with GPUDirect TCPXO networking, NRI device injection, and Kubeflow Trainer support.

Motivation / Context

Enable AICR recipe generation for GKE clusters running COS with H100 GPUs (a3-megagpu-8g). This is the GKE equivalent of the existing EKS Ubuntu training overlays.

Fixes: N/A
Related: #380, #381, #343

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: recipes/components, recipes/overlays, demos, pkg/evidence

Implementation Notes

Recipe chain: base → gke-cos → gke-cos-training → h100-gke-cos-training → h100-gke-cos-training-kubeflow

New overlays:

  • gke-cos-training — GKE COS + training intent with GPU Operator values
  • h100-gke-cos-training — H100-specific with TCPXO, NRI, skyhook COS tuning
  • h100-gke-cos-training-kubeflow — adds Kubeflow Trainer for TrainJob

New components:

  • gke-nccl-tcpxo — NCCL TCPXO installer (v1.0.15) + NRI device injector manifests
  • gpu-operator/values-gke-cos-training.yaml — training-optimized GPU Operator values
  • gpu-operator/manifests/gke-resource-quota.yaml — system-critical ResourceQuota for GKE
  • skyhook-customizations/manifests/tuning-gke.yaml — COS kernel sysctl tuning

Validator changes:

  • GKE NCCL performance test skipped with informative warning (not yet automated; requires raw Pods with TCPXO sidecar instead of TrainJob)
  • GKE H100 testdata added for manual execution

Evidence collection:

  • Auto-detect cluster description from node metadata (provider, instance type, accelerator) instead of hardcoded recipe name

Known limitations:

Testing

go test ./pkg/recipe/... -run TestAllComponentTypesValid  # PASS
go test ./pkg/recipe/... -run TestConformanceRecipeInvariants  # PASS

Deployed and validated on GKE a3-megagpu-8g cluster with 2 H100 nodes. NCCL AllReduce ~335 GB/s.

Risk Assessment

  • Low — Additive change, new overlays and components only, no modifications to existing EKS recipes
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: N/A — new recipe chain, no impact on existing recipes

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

Copy link
Copy Markdown

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivy found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@yuanchen8911 yuanchen8911 force-pushed the feat/gke-cos-training-overlays branch 2 times, most recently from 3112e5a to 8c099b7 Compare March 12, 2026 20:09
@yuanchen8911 yuanchen8911 force-pushed the feat/gke-cos-training-overlays branch 5 times, most recently from 94eabbe to 35cbc17 Compare March 12, 2026 20:36
@yuanchen8911 yuanchen8911 force-pushed the feat/gke-cos-training-overlays branch 4 times, most recently from cdef847 to 8c2b23c Compare March 12, 2026 21:23
@yuanchen8911 yuanchen8911 force-pushed the feat/gke-cos-training-overlays branch 4 times, most recently from e01f470 to 006d12b Compare March 12, 2026 22:26
@yuanchen8911 yuanchen8911 changed the title WIP: feat(recipes): add GKE COS training overlays for H100 feat(recipes): add GKE COS training overlays for H100 Mar 12, 2026
@yuanchen8911 yuanchen8911 force-pushed the feat/gke-cos-training-overlays branch from 006d12b to dc517cd Compare March 12, 2026 22:53
@yuanchen8911 yuanchen8911 marked this pull request as ready for review March 12, 2026 22:53
@yuanchen8911 yuanchen8911 requested review from a team as code owners March 12, 2026 22:53
dims
dims previously approved these changes Mar 12, 2026
@mchmarny
Copy link
Copy Markdown
Member

Would it be worth doing a spot check on this branch on EKS? Or have you already done that? I couldn’t tell from PR. How confident are we with the E2E tests in CI covering for possible regressions?

@mchmarny
Copy link
Copy Markdown
Member

Review: GKE COS Training Overlays for H100

Overall: Well-structured, additive PR. Good recipe chain design. A few issues to address.

CI Status

  • All KWOK tests pass (including new gke-cos, gke-cos-training, h100-gke-cos-training, h100-gke-cos-training-kubeflow)
  • GPU conformance and training tests still running
  • Trivy, ClamAV, CodeQL all pass

Critical

1. Breaking change to no-op.yaml defaults — affects existing EKS deployments

The runtimeRequired default changed from true (hardcoded) to false (templated). This changes behavior for all existing recipes that use no-op.yaml (e.g., EKS inference overlays). With runtimeRequired: false, skyhook won't taint GPU nodes until the package completes, potentially allowing GPU pods to schedule before skyhook is ready.

Similarly, autoTaintNewNodes default changed from true to false.

This is a behavioral regression for existing deployments, not just a GKE-additive change.

Suggestion: Either revert the defaults in no-op.yaml back to true (matching current behavior), or split this into a separate tracked change with explicit justification for why the default should change for all recipes.


Important

2. Duplicated cluster-detection logic in collect-evidence.sh

The provider detection + cluster description logic (write_section_header at ~line 147 and main summary at ~line 1475) is copy-pasted. If a third provider is added, both blocks must be updated. Extract to a shared function.

3. Typo in Go skip message

In nccl_all_reduce_bw_constraint.go, the pending-combination skip message has a typo — "implement" should be "implemented":

"skipped - %s+%s NCCL performance test exists but automated execution is not yet implement..."

4. bitnami/kubectl:latest in trainjob.yaml

The trigger Job uses bitnami/kubectl:latest — an unpinned tag. This is a reproducibility concern and conflicts with the project's image provenance goals. Pin to a specific version.


Minor

5. gke-cos.yaml removed accelerator: any and intent: any
The existing overlay explicitly set these. The new version relies on implicit defaults. Verify the recipe engine handles missing criteria fields the same as any.

6. values-gke-cos.yaml and values-gke-cos-training.yaml overlap
The training values file duplicates most of the base COS values (cdi, hostPaths, driver, gdrcopy, toolkit). Since gke-cos-training inherits from gke-cos in the overlay chain, confirm whether the overlay chain properly merges values or if the training file intentionally needs to be self-contained.

7. Evidence collection silently drops sections
The changes to collect-evidence.sh remove write_section_header + skip messages for gateway and operator when not found. Previously these produced a visible "SKIP" in the evidence markdown. Now they're silently omitted. The new summary table compensates, but the individual evidence files lose traceability. Seems intentional — worth confirming.

8. demos/workloads/inference/vllm-agg.yaml label changes are unrelated
The dedicated: cpu-workloadnodeGroup: cpu-worker changes appear unrelated to GKE COS training. Consider splitting into a separate commit for cleaner history.


Strengths

  • Clean recipe chain design: base → gke-cos → gke-cos-training → h100-gke-cos-training → h100-gke-cos-training-kubeflow
  • Well-documented Trivy ignores with clear justifications per CVE
  • Good use of pendingNCCLCombinations pattern — honest about what's automated vs. manual
  • Thorough CUJ documentation in demos/cuj1-gke.md
  • Evidence summary table is a nice improvement

@yuanchen8911
Copy link
Copy Markdown
Contributor Author

ould it be worth doing a spot check on this branch on EKS? Or have you already done that? I couldn’t tell from PR. How confident are we with the E2E tests in CI covering for possible regressions?

validate on cuj1 and cuj2 using the branch. No regression

Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key blockers:

  • Breaking change to no-op.yaml defaults — affects existing EKS deployments
  • Duplicated cluster-detection logic in collect-evidence.sh
  • fmt.Sprintf used for error-like return in Go code
  • Demo YAML duplication (gke-nccl-test-tcpxo.yaml)
  • bitnami/kubectl:latest in trainjob.yaml

Add complete GKE Container-Optimized OS (COS) training recipe chain
with GPUDirect TCPXO networking, NRI device injection, and Kubeflow
Trainer support.

Recipe chain: base → gke-cos → gke-cos-training → h100-gke-cos-training
  → h100-gke-cos-training-kubeflow

New overlays:
- gke-cos-training: GKE COS + training intent with GPU Operator values
- h100-gke-cos-training: H100-specific with TCPXO, NRI, skyhook tuning
- h100-gke-cos-training-kubeflow: adds Kubeflow Trainer for TrainJob

New components:
- gke-nccl-tcpxo: NCCL TCPXO installer + NRI device injector manifests
- gpu-operator/values-gke-cos-training.yaml: training GPU Operator values
- gpu-operator/manifests/gke-resource-quota.yaml: system-critical quota
- skyhook-customizations/manifests/tuning-gke.yaml: COS kernel tuning

Validator changes:
- Skip GKE NCCL performance test with informative warning (not yet
  automated; requires raw Pods with TCPXO sidecar)
- GKE H100 testdata added for manual execution

Evidence collection:
- Auto-detect cluster description from node metadata instead of
  hardcoded recipe name

Also includes:
- Demo workloads for GKE NCCL TCPXO benchmark and CUJ1 guide
- Fix vllm-agg tolerations/nodeSelectors to match AICR convention
- Skyhook no-op runtimeRequired/autoTaintNewNodes default to false

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
@yuanchen8911 yuanchen8911 force-pushed the feat/gke-cos-training-overlays branch from 5033975 to 4fd364d Compare March 12, 2026 23:29
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

@mchmarny Thanks for the review! Here's how I addressed each item:

Feedback Fix
Breaking change to no-op.yaml defaults Dropped no-op.yaml from the PR entirely — GKE uses tuning-gke.yaml (from PRs #343/#380), so no-op doesn't need changes
Duplicated cluster-detection logic Extracted detect_cluster_info() function, called once in main(), globals reused in both header and summary
fmt.Sprintf for error-like return No change needed — it's a status message string (first return value), not an error
Demo YAML duplication Added comment pointing to canonical source (validators/performance/testdata/h100/gke/runtime.yaml)
bitnami/kubectl:latest Pinned to bitnami/kubectl:1.35

@yuanchen8911 yuanchen8911 merged commit d992630 into NVIDIA:main Mar 12, 2026
49 of 50 checks passed
xdu31 pushed a commit to xdu31/aicr that referenced this pull request Mar 24, 2026
Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants