fix(validator): templatize EKS NCCL runtime for dynamic EFA and instance type discovery#447
Conversation
…nce type discovery
yuanchen8911
left a comment
There was a problem hiding this comment.
LGTM — good improvement. Dynamic discovery eliminates the p5.48xlarge/EFA hardcoding and the TCP fallback is well-handled.
A few observations (non-blocking):
Medium:
-
Single-node assumption for EKS config discovery.
discoverEKSNodeConfig(config.Nodes[0])reads instance type and EFA count from the first GPU node only. In heterogeneous clusters (e.g., mixed p5.48xlarge and p4d.24xlarge), this could select the wrong instance type for the nodeSelector, causing pods to schedule on nodes with a different EFA count than expected. Consider validating that all GPU nodes have the same instance type, or at minimum logging a warning if they differ. -
Empty EFA resource lines produce blank lines in YAML. When
buildEFAResourceLinereturns"", the template substitution for${EFA_RESOURCE_LIMITS}and${EFA_RESOURCE_REQUESTS}will leave empty lines in the rendered YAML. While most YAML parsers tolerate this, it could cause yamllint warnings or confuse users inspecting the rendered output. Consider having the template handle the conditional (e.g., an{{if}}block) or trimming blank lines after substitution.
Low:
-
Nodesfield added togpuConfigurationbut only used by EKS. GKE doesn't useconfig.Nodes— it callsdiscoverGKEGPUNICNetworksvia the dynamic client instead. The field is fine for now, but if more platforms need node-level discovery, consider a dedicated method ongpuConfigurationrather than carrying the full node list. -
Test coverage gap. No test for the TCP fallback path end-to-end (EFA count 0 →
maxMessageSizeTCPintemplateData). The unit tests coverdiscoverEKSNodeConfigandbuildEFAResourceLineindividually, but the integration inapplyNCCLResourcesis only validated manually.
|
High finding: The no-EFA fallback is currently incomplete. In validators/performance/nccl_all_reduce_bw_constraint.go, With no EFA devices/plugin, forcing Suggested fix:
|
In practice our EKS test did complete successfully with FI_PROVIDER=efa and no EFA devices — NCCL fell back to TCP automatically and got ~3 GB/s. The NCCL/libfabric stack handles the missing provider gracefully. |
For 1: The current implementation uses For a future PR, we plan to add a GPU product compatibility layer using 1. Define NCCL-compatible families — group GPU products by NVLink topology:
2. Validate node compatibility — ensure all GPU nodes belong to the same family before scheduling NCCL workers. Products within the same family (e.g., HBM3 + HBM3e) can participate in the same collective, though performance is limited by the slowest variant. Note Non-blocking for the current PR — cloud instance-type selectors already prevent mixing, and the heterogeneous node warning provides early detection for edge cases. |
… message size for EFA-absent clusters
6ea85bc to
2541c38
Compare
yuanchen8911
left a comment
There was a problem hiding this comment.
Non-blocking note: warnIfHeterogeneousNodes is log-only — it won't surface to the user in the validator output/report. If a user has mixed instance types, the NCCL job will still run with the first node's config and may fail with a confusing error. Consider promoting this to a validator warning in the results when the GPU product compatibility layer lands. As a short-term solution with that layer planned, this is fine.
LGTM.
Summary
from GPU nodes instead of hardcoding
p5.48xlargeandefa: "32"reduced max message size (4G vs 16G) to avoid hangs
nccl_gke_utils.goandnccl_eks_utils.goNCCL_DEBUGfrom INFO to WARN to reduce log noiseMotivation
The EKS runtime hardcoded
p5.48xlargeandvpc.amazonaws.com/efa: "32",preventing it from working on clusters with different instance types or
without the EFA device plugin installed. Pods would fail with
Insufficient vpc.amazonaws.com/efaon clusters lacking the EFA device plugin.How it works
discoverEKSNodeConfig()reads instance type fromnode.kubernetes.io/instance-typelabel and EFA count fromvpc.amazonaws.com/efaallocatable resourcebuildEFAResourceLine()conditionally injects EFA resource requests/limits — returns empty string when EFA count is 0maxMessageSizeTCP) to prevent hangs on large all_reduce over TCPTest plan
NCCL_DEBUG=WARNreduces log noisego test -race ./validators/performance/...