feat(recipes): add GKE COS training overlays for H100 by yuanchen8911 · Pull Request #383 · NVIDIA/aicr

yuanchen8911 · 2026-03-12T19:59:39Z

Summary

Add complete GKE Container-Optimized OS (COS) training recipe chain with GPUDirect TCPXO networking, NRI device injection, and Kubeflow Trainer support.

Motivation / Context

Enable AICR recipe generation for GKE clusters running COS with H100 GPUs (a3-megagpu-8g). This is the GKE equivalent of the existing EKS Ubuntu training overlays.

Fixes: N/A
Related: #380, #381, #343

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: recipes/components, recipes/overlays, demos, pkg/evidence

Implementation Notes

Recipe chain: base → gke-cos → gke-cos-training → h100-gke-cos-training → h100-gke-cos-training-kubeflow

New overlays:

gke-cos-training — GKE COS + training intent with GPU Operator values
h100-gke-cos-training — H100-specific with TCPXO, NRI, skyhook COS tuning
h100-gke-cos-training-kubeflow — adds Kubeflow Trainer for TrainJob

New components:

gke-nccl-tcpxo — NCCL TCPXO installer (v1.0.15) + NRI device injector manifests
gpu-operator/values-gke-cos-training.yaml — training-optimized GPU Operator values
gpu-operator/manifests/gke-resource-quota.yaml — system-critical ResourceQuota for GKE
skyhook-customizations/manifests/tuning-gke.yaml — COS kernel sysctl tuning

Validator changes:

GKE NCCL performance test skipped with informative warning (not yet automated; requires raw Pods with TCPXO sidecar instead of TrainJob)
GKE H100 testdata added for manual execution

Evidence collection:

Auto-detect cluster description from node metadata (provider, instance type, accelerator) instead of hardcoded recipe name

Known limitations:

TCPXO daemon requires hostNetwork: true for full PCI sysfs visibility (upstream: GPUDirect-TCPXO daemon sees incomplete PCI GPU inventory without hostNetwork on GKE A3 Mega GoogleCloudPlatform/container-engine-accelerators#580)
NCCL performance validation not yet automated for GKE (manual testdata provided)

Testing

go test ./pkg/recipe/... -run TestAllComponentTypesValid  # PASS
go test ./pkg/recipe/... -run TestConformanceRecipeInvariants  # PASS

Deployed and validated on GKE a3-megagpu-8g cluster with 2 H100 nodes. NCCL AllReduce ~335 GB/s.

Risk Assessment

Low — Additive change, new overlays and components only, no modifications to existing EKS recipes
Medium — Touches multiple components or has broader impact
High — Breaking change, affects critical paths, or complex rollout

Rollout notes: N/A — new recipe chain, no impact on existing recipes

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S) — GPG signing info

github-advanced-security

Trivy found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

mchmarny · 2026-03-12T23:17:59Z

Would it be worth doing a spot check on this branch on EKS? Or have you already done that? I couldn’t tell from PR. How confident are we with the E2E tests in CI covering for possible regressions?

mchmarny · 2026-03-12T23:18:48Z

Review: GKE COS Training Overlays for H100

Overall: Well-structured, additive PR. Good recipe chain design. A few issues to address.

CI Status

All KWOK tests pass (including new gke-cos, gke-cos-training, h100-gke-cos-training, h100-gke-cos-training-kubeflow)
GPU conformance and training tests still running
Trivy, ClamAV, CodeQL all pass

Critical

1. Breaking change to no-op.yaml defaults — affects existing EKS deployments

The runtimeRequired default changed from true (hardcoded) to false (templated). This changes behavior for all existing recipes that use no-op.yaml (e.g., EKS inference overlays). With runtimeRequired: false, skyhook won't taint GPU nodes until the package completes, potentially allowing GPU pods to schedule before skyhook is ready.

Similarly, autoTaintNewNodes default changed from true to false.

This is a behavioral regression for existing deployments, not just a GKE-additive change.

Suggestion: Either revert the defaults in no-op.yaml back to true (matching current behavior), or split this into a separate tracked change with explicit justification for why the default should change for all recipes.

Important

2. Duplicated cluster-detection logic in collect-evidence.sh

The provider detection + cluster description logic (write_section_header at ~line 147 and main summary at ~line 1475) is copy-pasted. If a third provider is added, both blocks must be updated. Extract to a shared function.

3. Typo in Go skip message

In nccl_all_reduce_bw_constraint.go, the pending-combination skip message has a typo — "implement" should be "implemented":

"skipped - %s+%s NCCL performance test exists but automated execution is not yet implement..."

4. bitnami/kubectl:latest in trainjob.yaml

The trigger Job uses bitnami/kubectl:latest — an unpinned tag. This is a reproducibility concern and conflicts with the project's image provenance goals. Pin to a specific version.

Minor

5. gke-cos.yaml removed accelerator: any and intent: any
The existing overlay explicitly set these. The new version relies on implicit defaults. Verify the recipe engine handles missing criteria fields the same as any.

6. values-gke-cos.yaml and values-gke-cos-training.yaml overlap
The training values file duplicates most of the base COS values (cdi, hostPaths, driver, gdrcopy, toolkit). Since gke-cos-training inherits from gke-cos in the overlay chain, confirm whether the overlay chain properly merges values or if the training file intentionally needs to be self-contained.

7. Evidence collection silently drops sections
The changes to collect-evidence.sh remove write_section_header + skip messages for gateway and operator when not found. Previously these produced a visible "SKIP" in the evidence markdown. Now they're silently omitted. The new summary table compensates, but the individual evidence files lose traceability. Seems intentional — worth confirming.

8. demos/workloads/inference/vllm-agg.yaml label changes are unrelated
The dedicated: cpu-workload → nodeGroup: cpu-worker changes appear unrelated to GKE COS training. Consider splitting into a separate commit for cleaner history.

Strengths

Clean recipe chain design: base → gke-cos → gke-cos-training → h100-gke-cos-training → h100-gke-cos-training-kubeflow
Well-documented Trivy ignores with clear justifications per CVE
Good use of pendingNCCLCombinations pattern — honest about what's automated vs. manual
Thorough CUJ documentation in demos/cuj1-gke.md
Evidence summary table is a nice improvement

yuanchen8911 · 2026-03-12T23:18:58Z

ould it be worth doing a spot check on this branch on EKS? Or have you already done that? I couldn’t tell from PR. How confident are we with the E2E tests in CI covering for possible regressions?

validate on cuj1 and cuj2 using the branch. No regression

mchmarny

The key blockers:

Breaking change to no-op.yaml defaults — affects existing EKS deployments
Duplicated cluster-detection logic in collect-evidence.sh
fmt.Sprintf used for error-like return in Go code
Demo YAML duplication (gke-nccl-test-tcpxo.yaml)
bitnami/kubectl:latest in trainjob.yaml

Add complete GKE Container-Optimized OS (COS) training recipe chain with GPUDirect TCPXO networking, NRI device injection, and Kubeflow Trainer support. Recipe chain: base → gke-cos → gke-cos-training → h100-gke-cos-training → h100-gke-cos-training-kubeflow New overlays: - gke-cos-training: GKE COS + training intent with GPU Operator values - h100-gke-cos-training: H100-specific with TCPXO, NRI, skyhook tuning - h100-gke-cos-training-kubeflow: adds Kubeflow Trainer for TrainJob New components: - gke-nccl-tcpxo: NCCL TCPXO installer + NRI device injector manifests - gpu-operator/values-gke-cos-training.yaml: training GPU Operator values - gpu-operator/manifests/gke-resource-quota.yaml: system-critical quota - skyhook-customizations/manifests/tuning-gke.yaml: COS kernel tuning Validator changes: - Skip GKE NCCL performance test with informative warning (not yet automated; requires raw Pods with TCPXO sidecar) - GKE H100 testdata added for manual execution Evidence collection: - Auto-detect cluster description from node metadata instead of hardcoded recipe name Also includes: - Demo workloads for GKE NCCL TCPXO benchmark and CUJ1 guide - Fix vllm-agg tolerations/nodeSelectors to match AICR convention - Skyhook no-op runtimeRequired/autoTaintNewNodes default to false Signed-off-by: Yuan Chen <yuanchen97@gmail.com>

yuanchen8911 · 2026-03-12T23:31:44Z

@mchmarny Thanks for the review! Here's how I addressed each item:

Feedback	Fix
Breaking change to no-op.yaml defaults	Dropped `no-op.yaml` from the PR entirely — GKE uses `tuning-gke.yaml` (from PRs #343/#380), so no-op doesn't need changes
Duplicated cluster-detection logic	Extracted `detect_cluster_info()` function, called once in `main()`, globals reused in both header and summary
`fmt.Sprintf` for error-like return	No change needed — it's a status message string (first return value), not an error
Demo YAML duplication	Added comment pointing to canonical source (`validators/performance/testdata/h100/gke/runtime.yaml`)
`bitnami/kubectl:latest`	Pinned to `bitnami/kubectl:1.35`

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>

yuanchen8911 added enhancement New feature or request area/recipes labels Mar 12, 2026

github-actions bot added area/ci area/validator area/docs area/bundler area/api area/collector area/infra size/XL labels Mar 12, 2026

github-advanced-security bot found potential problems Mar 12, 2026

View reviewed changes

yuanchen8911 force-pushed the feat/gke-cos-training-overlays branch 2 times, most recently from 3112e5a to 8c099b7 Compare March 12, 2026 20:09

github-actions bot removed area/ci area/validator area/docs area/bundler area/api area/collector area/infra labels Mar 12, 2026

yuanchen8911 force-pushed the feat/gke-cos-training-overlays branch 5 times, most recently from 94eabbe to 35cbc17 Compare March 12, 2026 20:36

github-actions bot added the area/ci label Mar 12, 2026

yuanchen8911 force-pushed the feat/gke-cos-training-overlays branch 4 times, most recently from cdef847 to 8c2b23c Compare March 12, 2026 21:23

yuanchen8911 force-pushed the feat/gke-cos-training-overlays branch 4 times, most recently from e01f470 to 006d12b Compare March 12, 2026 22:26

yuanchen8911 mentioned this pull request Mar 12, 2026

docs(gke): hostNetwork requirement for TCPXO and non-privileged workaround #381

Closed

4 tasks

yuanchen8911 changed the title ~~WIP: feat(recipes): add GKE COS training overlays for H100~~ feat(recipes): add GKE COS training overlays for H100 Mar 12, 2026

yuanchen8911 force-pushed the feat/gke-cos-training-overlays branch from 006d12b to dc517cd Compare March 12, 2026 22:53

yuanchen8911 marked this pull request as ready for review March 12, 2026 22:53

yuanchen8911 requested review from a team as code owners March 12, 2026 22:53

yuanchen8911 requested review from ayuskauskas and mchmarny March 12, 2026 22:53

dims previously approved these changes Mar 12, 2026

View reviewed changes

mchmarny requested changes Mar 12, 2026

View reviewed changes

mchmarny assigned yuanchen8911 Mar 12, 2026

mchmarny added this to the M2 - KubeCon EU milestone Mar 12, 2026

yuanchen8911 dismissed dims’s stale review via 5033975 March 12, 2026 23:24

yuanchen8911 force-pushed the feat/gke-cos-training-overlays branch from dc517cd to 5033975 Compare March 12, 2026 23:24

yuanchen8911 force-pushed the feat/gke-cos-training-overlays branch from 5033975 to 4fd364d Compare March 12, 2026 23:29

mchmarny approved these changes Mar 12, 2026

View reviewed changes

yuanchen8911 merged commit d992630 into NVIDIA:main Mar 12, 2026
49 of 50 checks passed

xdu31 pushed a commit to xdu31/aicr that referenced this pull request Mar 24, 2026

feat(recipes): add GKE COS training overlays for H100 (NVIDIA#383)

f005e30

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(recipes): add GKE COS training overlays for H100#383

feat(recipes): add GKE COS training overlays for H100#383
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:feat/gke-cos-training-overlays

yuanchen8911 commented Mar 12, 2026

Uh oh!

github-advanced-security bot left a comment

Uh oh!

mchmarny commented Mar 12, 2026

Uh oh!

mchmarny commented Mar 12, 2026

Uh oh!

yuanchen8911 commented Mar 12, 2026

Uh oh!

mchmarny left a comment

Uh oh!

yuanchen8911 commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuanchen8911 commented Mar 12, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

github-advanced-security bot left a comment

Choose a reason for hiding this comment

Uh oh!

mchmarny commented Mar 12, 2026

Uh oh!

mchmarny commented Mar 12, 2026

Review: GKE COS Training Overlays for H100

CI Status

Critical

Important

Minor

Strengths

Uh oh!

yuanchen8911 commented Mar 12, 2026

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

yuanchen8911 commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants