Skip to content

fix(validator): templatize EKS NCCL runtime for dynamic EFA and instance type discovery#447

Merged
xdu31 merged 4 commits intoNVIDIA:mainfrom
xdu31:feat/eks-nccl-unify
Mar 20, 2026
Merged

fix(validator): templatize EKS NCCL runtime for dynamic EFA and instance type discovery#447
xdu31 merged 4 commits intoNVIDIA:mainfrom
xdu31:feat/eks-nccl-unify

Conversation

@xdu31
Copy link
Copy Markdown
Contributor

@xdu31 xdu31 commented Mar 20, 2026

Summary

  • Templatize EKS runtime: discover instance type and EFA adapter count
    from GPU nodes instead of hardcoding p5.48xlarge and efa: "32"
  • Handle EFA-absent clusters gracefully: NCCL falls back to TCP with
    reduced max message size (4G vs 16G) to avoid hangs
  • Split CSP-specific helpers into nccl_gke_utils.go and nccl_eks_utils.go
  • Change EKS runtime NCCL_DEBUG from INFO to WARN to reduce log noise

Motivation

The EKS runtime hardcoded p5.48xlarge and vpc.amazonaws.com/efa: "32",
preventing it from working on clusters with different instance types or
without the EFA device plugin installed. Pods would fail with
Insufficient vpc.amazonaws.com/efa on clusters lacking the EFA device plugin.

How it works

  1. discoverEKSNodeConfig() reads instance type from node.kubernetes.io/instance-type label and EFA count from vpc.amazonaws.com/efa allocatable resource
  2. buildEFAResourceLine() conditionally injects EFA resource requests/limits — returns empty string when EFA count is 0
  3. When EFA is absent, NCCL falls back to TCP and max message size is capped at 4G (maxMessageSizeTCP) to prevent hangs on large all_reduce over TCP
  4. GKE and EKS helpers are split into dedicated files for maintainability

Test plan

  • Run NCCL All-Reduce validation on EKS H100 cluster (p5.48xlarge, no EFA device plugin)
  • Verify dynamic instance type discovery from node label
  • Verify graceful TCP fallback when EFA device plugin is absent (~3 GB/s)
  • Verify 4G max message size cap prevents test hangs over TCP
  • Verify NCCL_DEBUG=WARN reduces log noise
  • Unit tests pass: go test -race ./validators/performance/...

@xdu31 xdu31 requested a review from a team as a code owner March 20, 2026 05:38
yuanchen8911
yuanchen8911 previously approved these changes Mar 20, 2026
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — good improvement. Dynamic discovery eliminates the p5.48xlarge/EFA hardcoding and the TCP fallback is well-handled.

A few observations (non-blocking):

Medium:

  1. Single-node assumption for EKS config discovery. discoverEKSNodeConfig(config.Nodes[0]) reads instance type and EFA count from the first GPU node only. In heterogeneous clusters (e.g., mixed p5.48xlarge and p4d.24xlarge), this could select the wrong instance type for the nodeSelector, causing pods to schedule on nodes with a different EFA count than expected. Consider validating that all GPU nodes have the same instance type, or at minimum logging a warning if they differ.

  2. Empty EFA resource lines produce blank lines in YAML. When buildEFAResourceLine returns "", the template substitution for ${EFA_RESOURCE_LIMITS} and ${EFA_RESOURCE_REQUESTS} will leave empty lines in the rendered YAML. While most YAML parsers tolerate this, it could cause yamllint warnings or confuse users inspecting the rendered output. Consider having the template handle the conditional (e.g., an {{if}} block) or trimming blank lines after substitution.

Low:

  1. Nodes field added to gpuConfiguration but only used by EKS. GKE doesn't use config.Nodes — it calls discoverGKEGPUNICNetworks via the dynamic client instead. The field is fine for now, but if more platforms need node-level discovery, consider a dedicated method on gpuConfiguration rather than carrying the full node list.

  2. Test coverage gap. No test for the TCP fallback path end-to-end (EFA count 0 → maxMessageSizeTCP in templateData). The unit tests cover discoverEKSNodeConfig and buildEFAResourceLine individually, but the integration in applyNCCLResources is only validated manually.

@yuanchen8911
Copy link
Copy Markdown
Contributor

High finding: The no-EFA fallback is currently incomplete.

In validators/performance/nccl_all_reduce_bw_constraint.go, efaCount == 0 is treated as a TCP fallback path, but the EKS runtime template still hardcodes FI_PROVIDER=efa (and related EFA-specific env behavior) in validators/performance/testdata/h100/eks/runtime.yaml.

With no EFA devices/plugin, forcing FI_PROVIDER=efa can prevent libfabric/NCCL from using TCP, so this path may fail instead of degrading gracefully.

Suggested fix:

  1. Template FI_PROVIDER (and EFA-only env vars) based on discovered efaCount.
  2. For efaCount == 0, render TCP-compatible settings.
  3. Add tests that verify rendered runtime content for both EFA and non-EFA branches.

@xdu31
Copy link
Copy Markdown
Contributor Author

xdu31 commented Mar 20, 2026

High finding: The no-EFA fallback is currently incomplete.

In validators/performance/nccl_all_reduce_bw_constraint.go, efaCount == 0 is treated as a TCP fallback path, but the EKS runtime template still hardcodes FI_PROVIDER=efa (and related EFA-specific env behavior) in validators/performance/testdata/h100/eks/runtime.yaml.

With no EFA devices/plugin, forcing FI_PROVIDER=efa can prevent libfabric/NCCL from using TCP, so this path may fail instead of degrading gracefully.

Suggested fix:

  1. Template FI_PROVIDER (and EFA-only env vars) based on discovered efaCount.
  2. For efaCount == 0, render TCP-compatible settings.
  3. Add tests that verify rendered runtime content for both EFA and non-EFA branches.

In practice our EKS test did complete successfully with FI_PROVIDER=efa and no EFA devices — NCCL fell back to TCP automatically and got ~3 GB/s. The NCCL/libfabric stack handles the missing provider gracefully.

@xdu31
Copy link
Copy Markdown
Contributor Author

xdu31 commented Mar 20, 2026

LGTM — good improvement. Dynamic discovery eliminates the p5.48xlarge/EFA hardcoding and the TCP fallback is well-handled.

A few observations (non-blocking):

Medium:

  1. Single-node assumption for EKS config discovery. discoverEKSNodeConfig(config.Nodes[0]) reads instance type and EFA count from the first GPU node only. In heterogeneous clusters (e.g., mixed p5.48xlarge and p4d.24xlarge), this could select the wrong instance type for the nodeSelector, causing pods to schedule on nodes with a different EFA count than expected. Consider validating that all GPU nodes have the same instance type, or at minimum logging a warning if they differ.
  2. Empty EFA resource lines produce blank lines in YAML. When buildEFAResourceLine returns "", the template substitution for ${EFA_RESOURCE_LIMITS} and ${EFA_RESOURCE_REQUESTS} will leave empty lines in the rendered YAML. While most YAML parsers tolerate this, it could cause yamllint warnings or confuse users inspecting the rendered output. Consider having the template handle the conditional (e.g., an {{if}} block) or trimming blank lines after substitution.

Low:

  1. Nodes field added to gpuConfiguration but only used by EKS. GKE doesn't use config.Nodes — it calls discoverGKEGPUNICNetworks via the dynamic client instead. The field is fine for now, but if more platforms need node-level discovery, consider a dedicated method on gpuConfiguration rather than carrying the full node list.
  2. Test coverage gap. No test for the TCP fallback path end-to-end (EFA count 0 → maxMessageSizeTCP in templateData). The unit tests cover discoverEKSNodeConfig and buildEFAResourceLine individually, but the integration in applyNCCLResources is only validated manually.

For 1:

The current implementation uses node.kubernetes.io/instance-type as the nodeSelector, which implicitly guarantees GPU homogeneity on cloud (e.g., p5.48xlarge = always H100 SXM5 + 32 EFA). A heterogeneous node warning was added to catch misconfigurations early.

For a future PR, we plan to add a GPU product compatibility layer using nvidia.com/gpu.product labels (set by GPU Feature Discovery). This would:

1. Define NCCL-compatible families — group GPU products by NVLink topology:

Family Products
h100-sxm NVIDIA-H100-80GB-HBM3, NVIDIA-H100-80GB-HBM3e
h100-pcie NVIDIA-H100-PCIe
h100-nvl NVIDIA-H100-NVL
a100-sxm NVIDIA-A100-SXM4-80GB, NVIDIA-A100-SXM4-40GB
a100-pcie NVIDIA-A100-PCIe-80GB, NVIDIA-A100-PCIe-40GB

2. Validate node compatibility — ensure all GPU nodes belong to the same family before scheduling NCCL workers. Products within the same family (e.g., HBM3 + HBM3e) can participate in the same collective, though performance is limited by the slowest variant.

Note

Non-blocking for the current PR — cloud instance-type selectors already prevent mixing, and the heterogeneous node warning provides early detection for edge cases.

@xdu31 xdu31 force-pushed the feat/eks-nccl-unify branch from 6ea85bc to 2541c38 Compare March 20, 2026 17:55
@xdu31 xdu31 requested a review from yuanchen8911 March 20, 2026 17:55
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking note: warnIfHeterogeneousNodes is log-only — it won't surface to the user in the validator output/report. If a user has mixed instance types, the NCCL job will still run with the first node's config and may fail with a confusing error. Consider promoting this to a validator warning in the results when the GPU product compatibility layer lands. As a short-term solution with that layer planned, this is fine.

LGTM.

@xdu31 xdu31 merged commit 692bbf0 into NVIDIA:main Mar 20, 2026
16 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants