You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Replace 23 hardcoded NCCL environment variables in the GKE H100 NCCL
TrainingRuntime with dynamic sourcing from the host's nccl-env-profile.sh
(installed by nccl-tcpxo-installer DaemonSet)
Fix MPI/UCX control NIC from eth1 to eth0 to match actual GKE
a3-megagpu-8g host network layout
Change NCCL_DEBUG from INFO to WARN to prevent Kubernetes log
buffer overflow that truncated benchmark results
Motivation
The hardcoded NCCL variables broke on clusters where the TCPXO guest
config checker enforced different values than our template. Specific
failures included NCCL_PROTO=Simple,LL128 (expected Simple),
missing NCCL_ALGO=Ring,Tree, missing NCCL_NVLS_ENABLE=0, and
incorrect NIC numbering (eth1-eth9 vs actual eth0-eth8).
Rather than maintaining a fragile copy of version-specific values,
workers now source nccl-env-profile.sh from the host at startup.
This profile is installed by the nccl-tcpxo-installer DaemonSet and
always matches the installed TCPXO version.
How it works
Worker entrypoint sources nccl-env-profile.sh from the host mount
Exports NCCL_* and CUDA_* vars to ~/.ssh/environment
Enables PermitUserEnvironment in sshd so MPI SSH sessions inherit vars
Launcher only passes MPI/UCX transport args (--mca, UCX_NET_DEVICES)
plus LD_LIBRARY_PATH and NCCL_DEBUG — no NCCL tuning vars
Test plan
Run NCCL All-Reduce validation on GKE H100 cluster (a3-megagpu-8g)
For the nccl-env-profile.sh dependency:
This is a deliberate design choice, not a gap. The whole point of this PR is to stop hardcoding NCCL vars and source them from the host instead. The profile is installed by the nccl-tcpxo-installer DaemonSet — it's a platform prerequisite on every GKE GPU node, same as having GPUs or the TCPXO plugin itself. If the file is missing, the shell entrypoint fails fast with a clear error rather than silently running with wrong values.
For NIC switch to eth0:
We unified on eth0 for consistency rather than having a mix, it is more like a cleanup, both are working, MPI just needs at least one routable interface that isn't a GPU NIC.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
TrainingRuntime with dynamic sourcing from the host's
nccl-env-profile.sh(installed by
nccl-tcpxo-installerDaemonSet)eth1toeth0to match actual GKEa3-megagpu-8g host network layout
NCCL_DEBUGfromINFOtoWARNto prevent Kubernetes logbuffer overflow that truncated benchmark results
Motivation
The hardcoded NCCL variables broke on clusters where the TCPXO guest
config checker enforced different values than our template. Specific
failures included
NCCL_PROTO=Simple,LL128(expectedSimple),missing
NCCL_ALGO=Ring,Tree, missingNCCL_NVLS_ENABLE=0, andincorrect NIC numbering (
eth1-eth9vs actualeth0-eth8).Rather than maintaining a fragile copy of version-specific values,
workers now source
nccl-env-profile.shfrom the host at startup.This profile is installed by the
nccl-tcpxo-installerDaemonSet andalways matches the installed TCPXO version.
How it works
nccl-env-profile.shfrom the host mountNCCL_*andCUDA_*vars to~/.ssh/environmentPermitUserEnvironmentin sshd so MPI SSH sessions inherit vars--mca,UCX_NET_DEVICES)plus
LD_LIBRARY_PATHandNCCL_DEBUG— no NCCL tuning varsTest plan