fix(validator): templatize EKS NCCL runtime for dynamic EFA and instance type discovery by xdu31 · Pull Request #447 · NVIDIA/aicr

xdu31 · 2026-03-20T05:38:09Z

Summary

Templatize EKS runtime: discover instance type and EFA adapter count
from GPU nodes instead of hardcoding p5.48xlarge and efa: "32"
Handle EFA-absent clusters gracefully: NCCL falls back to TCP with
reduced max message size (4G vs 16G) to avoid hangs
Split CSP-specific helpers into nccl_gke_utils.go and nccl_eks_utils.go
Change EKS runtime NCCL_DEBUG from INFO to WARN to reduce log noise

Motivation

The EKS runtime hardcoded p5.48xlarge and vpc.amazonaws.com/efa: "32",
preventing it from working on clusters with different instance types or
without the EFA device plugin installed. Pods would fail with
Insufficient vpc.amazonaws.com/efa on clusters lacking the EFA device plugin.

How it works

discoverEKSNodeConfig() reads instance type from node.kubernetes.io/instance-type label and EFA count from vpc.amazonaws.com/efa allocatable resource
buildEFAResourceLine() conditionally injects EFA resource requests/limits — returns empty string when EFA count is 0
When EFA is absent, NCCL falls back to TCP and max message size is capped at 4G (maxMessageSizeTCP) to prevent hangs on large all_reduce over TCP
GKE and EKS helpers are split into dedicated files for maintainability

Test plan

Run NCCL All-Reduce validation on EKS H100 cluster (p5.48xlarge, no EFA device plugin)
Verify dynamic instance type discovery from node label
Verify graceful TCP fallback when EFA device plugin is absent (~3 GB/s)
Verify 4G max message size cap prevents test hangs over TCP
Verify NCCL_DEBUG=WARN reduces log noise
Unit tests pass: go test -race ./validators/performance/...

…nce type discovery

yuanchen8911

LGTM — good improvement. Dynamic discovery eliminates the p5.48xlarge/EFA hardcoding and the TCP fallback is well-handled.

A few observations (non-blocking):

Medium:

Single-node assumption for EKS config discovery. discoverEKSNodeConfig(config.Nodes[0]) reads instance type and EFA count from the first GPU node only. In heterogeneous clusters (e.g., mixed p5.48xlarge and p4d.24xlarge), this could select the wrong instance type for the nodeSelector, causing pods to schedule on nodes with a different EFA count than expected. Consider validating that all GPU nodes have the same instance type, or at minimum logging a warning if they differ.
Empty EFA resource lines produce blank lines in YAML. When buildEFAResourceLine returns "", the template substitution for ${EFA_RESOURCE_LIMITS} and ${EFA_RESOURCE_REQUESTS} will leave empty lines in the rendered YAML. While most YAML parsers tolerate this, it could cause yamllint warnings or confuse users inspecting the rendered output. Consider having the template handle the conditional (e.g., an {{if}} block) or trimming blank lines after substitution.

Low:

Nodes field added to gpuConfiguration but only used by EKS. GKE doesn't use config.Nodes — it calls discoverGKEGPUNICNetworks via the dynamic client instead. The field is fine for now, but if more platforms need node-level discovery, consider a dedicated method on gpuConfiguration rather than carrying the full node list.
Test coverage gap. No test for the TCP fallback path end-to-end (EFA count 0 → maxMessageSizeTCP in templateData). The unit tests cover discoverEKSNodeConfig and buildEFAResourceLine individually, but the integration in applyNCCLResources is only validated manually.

yuanchen8911 · 2026-03-20T16:03:37Z

High finding: The no-EFA fallback is currently incomplete.

In validators/performance/nccl_all_reduce_bw_constraint.go, efaCount == 0 is treated as a TCP fallback path, but the EKS runtime template still hardcodes FI_PROVIDER=efa (and related EFA-specific env behavior) in validators/performance/testdata/h100/eks/runtime.yaml.

With no EFA devices/plugin, forcing FI_PROVIDER=efa can prevent libfabric/NCCL from using TCP, so this path may fail instead of degrading gracefully.

Suggested fix:

Template FI_PROVIDER (and EFA-only env vars) based on discovered efaCount.
For efaCount == 0, render TCP-compatible settings.
Add tests that verify rendered runtime content for both EFA and non-EFA branches.

xdu31 · 2026-03-20T16:25:07Z

High finding: The no-EFA fallback is currently incomplete.

In validators/performance/nccl_all_reduce_bw_constraint.go, efaCount == 0 is treated as a TCP fallback path, but the EKS runtime template still hardcodes FI_PROVIDER=efa (and related EFA-specific env behavior) in validators/performance/testdata/h100/eks/runtime.yaml.

With no EFA devices/plugin, forcing FI_PROVIDER=efa can prevent libfabric/NCCL from using TCP, so this path may fail instead of degrading gracefully.

Suggested fix:

Template FI_PROVIDER (and EFA-only env vars) based on discovered efaCount.

For efaCount == 0, render TCP-compatible settings.

Add tests that verify rendered runtime content for both EFA and non-EFA branches.

In practice our EKS test did complete successfully with FI_PROVIDER=efa and no EFA devices — NCCL fell back to TCP automatically and got ~3 GB/s. The NCCL/libfabric stack handles the missing provider gracefully.

xdu31 · 2026-03-20T17:49:54Z

LGTM — good improvement. Dynamic discovery eliminates the p5.48xlarge/EFA hardcoding and the TCP fallback is well-handled.

A few observations (non-blocking):

Medium:

Single-node assumption for EKS config discovery. discoverEKSNodeConfig(config.Nodes[0]) reads instance type and EFA count from the first GPU node only. In heterogeneous clusters (e.g., mixed p5.48xlarge and p4d.24xlarge), this could select the wrong instance type for the nodeSelector, causing pods to schedule on nodes with a different EFA count than expected. Consider validating that all GPU nodes have the same instance type, or at minimum logging a warning if they differ.

Empty EFA resource lines produce blank lines in YAML. When buildEFAResourceLine returns "", the template substitution for ${EFA_RESOURCE_LIMITS} and ${EFA_RESOURCE_REQUESTS} will leave empty lines in the rendered YAML. While most YAML parsers tolerate this, it could cause yamllint warnings or confuse users inspecting the rendered output. Consider having the template handle the conditional (e.g., an {{if}} block) or trimming blank lines after substitution.

Low:

Nodes field added to gpuConfiguration but only used by EKS. GKE doesn't use config.Nodes — it calls discoverGKEGPUNICNetworks via the dynamic client instead. The field is fine for now, but if more platforms need node-level discovery, consider a dedicated method on gpuConfiguration rather than carrying the full node list.

Test coverage gap. No test for the TCP fallback path end-to-end (EFA count 0 → maxMessageSizeTCP in templateData). The unit tests cover discoverEKSNodeConfig and buildEFAResourceLine individually, but the integration in applyNCCLResources is only validated manually.

For 1:

The current implementation uses node.kubernetes.io/instance-type as the nodeSelector, which implicitly guarantees GPU homogeneity on cloud (e.g., p5.48xlarge = always H100 SXM5 + 32 EFA). A heterogeneous node warning was added to catch misconfigurations early.

For a future PR, we plan to add a GPU product compatibility layer using nvidia.com/gpu.product labels (set by GPU Feature Discovery). This would:

1. Define NCCL-compatible families — group GPU products by NVLink topology:

Family	Products
`h100-sxm`	`NVIDIA-H100-80GB-HBM3`, `NVIDIA-H100-80GB-HBM3e`
`h100-pcie`	`NVIDIA-H100-PCIe`
`h100-nvl`	`NVIDIA-H100-NVL`
`a100-sxm`	`NVIDIA-A100-SXM4-80GB`, `NVIDIA-A100-SXM4-40GB`
`a100-pcie`	`NVIDIA-A100-PCIe-80GB`, `NVIDIA-A100-PCIe-40GB`

2. Validate node compatibility — ensure all GPU nodes belong to the same family before scheduling NCCL workers. Products within the same family (e.g., HBM3 + HBM3e) can participate in the same collective, though performance is limited by the slowest variant.

Note

Non-blocking for the current PR — cloud instance-type selectors already prevent mixing, and the heterogeneous node warning provides early detection for edge cases.

… message size for EFA-absent clusters

yuanchen8911

Non-blocking note: warnIfHeterogeneousNodes is log-only — it won't surface to the user in the validator output/report. If a user has mixed instance types, the NCCL job will still run with the first node's config and may fail with a confusing error. Consider promoting this to a validator warning in the results when the GPU product compatibility layer lands. As a short-term solution with that layer planned, this is fine.

LGTM.

fix(validator): templatize EKS NCCL runtime for dynamic EFA and insta…

38708c6

…nce type discovery

xdu31 requested a review from a team as a code owner March 20, 2026 05:38

github-actions bot added the size/XL label Mar 20, 2026

mchmarny assigned xdu31 Mar 20, 2026

mchmarny added the area/validator label Mar 20, 2026

Merge branch 'main' into feat/eks-nccl-unify

1e5a07c

github-actions bot removed the area/validator label Mar 20, 2026

yuanchen8911 previously approved these changes Mar 20, 2026

View reviewed changes

xdu31 dismissed yuanchen8911’s stale review via 6ea85bc March 20, 2026 17:44

fix(validator): add heterogeneous GPU node warning and reduce TCP max…

2541c38

… message size for EFA-absent clusters

xdu31 force-pushed the feat/eks-nccl-unify branch from 6ea85bc to 2541c38 Compare March 20, 2026 17:55

xdu31 requested a review from yuanchen8911 March 20, 2026 17:55

yuanchen8911 approved these changes Mar 20, 2026

View reviewed changes

Merge branch 'main' into feat/eks-nccl-unify

1feebec

xdu31 merged commit 692bbf0 into NVIDIA:main Mar 20, 2026
16 of 27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(validator): templatize EKS NCCL runtime for dynamic EFA and instance type discovery#447

fix(validator): templatize EKS NCCL runtime for dynamic EFA and instance type discovery#447
xdu31 merged 4 commits intoNVIDIA:mainfrom
xdu31:feat/eks-nccl-unify

xdu31 commented Mar 20, 2026

Uh oh!

yuanchen8911 left a comment

Uh oh!

yuanchen8911 commented Mar 20, 2026

Uh oh!

xdu31 commented Mar 20, 2026

Uh oh!

xdu31 commented Mar 20, 2026 •

edited

Loading

Uh oh!

yuanchen8911 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xdu31 commented Mar 20, 2026

Summary

Motivation

How it works

Test plan

Uh oh!

yuanchen8911 left a comment

Choose a reason for hiding this comment

Uh oh!

yuanchen8911 commented Mar 20, 2026

Uh oh!

xdu31 commented Mar 20, 2026

Uh oh!

xdu31 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuanchen8911 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xdu31 commented Mar 20, 2026 •

edited

Loading