Skip to content

fix(validator): source NCCL env from host profile instead of hardcoding#422

Merged
yuanchen8911 merged 2 commits intoNVIDIA:mainfrom
xdu31:fix/gke-nccl
Mar 18, 2026
Merged

fix(validator): source NCCL env from host profile instead of hardcoding#422
yuanchen8911 merged 2 commits intoNVIDIA:mainfrom
xdu31:fix/gke-nccl

Conversation

@xdu31
Copy link
Copy Markdown
Contributor

@xdu31 xdu31 commented Mar 18, 2026

Summary

  • Replace 23 hardcoded NCCL environment variables in the GKE H100 NCCL
    TrainingRuntime with dynamic sourcing from the host's nccl-env-profile.sh
    (installed by nccl-tcpxo-installer DaemonSet)
  • Fix MPI/UCX control NIC from eth1 to eth0 to match actual GKE
    a3-megagpu-8g host network layout
  • Change NCCL_DEBUG from INFO to WARN to prevent Kubernetes log
    buffer overflow that truncated benchmark results

Motivation

The hardcoded NCCL variables broke on clusters where the TCPXO guest
config checker enforced different values than our template. Specific
failures included NCCL_PROTO=Simple,LL128 (expected Simple),
missing NCCL_ALGO=Ring,Tree, missing NCCL_NVLS_ENABLE=0, and
incorrect NIC numbering (eth1-eth9 vs actual eth0-eth8).

Rather than maintaining a fragile copy of version-specific values,
workers now source nccl-env-profile.sh from the host at startup.
This profile is installed by the nccl-tcpxo-installer DaemonSet and
always matches the installed TCPXO version.

How it works

  1. Worker entrypoint sources nccl-env-profile.sh from the host mount
  2. Exports NCCL_* and CUDA_* vars to ~/.ssh/environment
  3. Enables PermitUserEnvironment in sshd so MPI SSH sessions inherit vars
  4. Launcher only passes MPI/UCX transport args (--mca, UCX_NET_DEVICES)
    plus LD_LIBRARY_PATH and NCCL_DEBUG — no NCCL tuning vars

Test plan

  • Run NCCL All-Reduce validation on GKE H100 cluster (a3-megagpu-8g)
  • Verify bandwidth ~335 GB/s at 8 GB message size (2 nodes)
  • Verify no guest config checker mismatches in launcher logs
  • Verify benchmark table appears in logs (not truncated)

@xdu31 xdu31 requested a review from a team as a code owner March 18, 2026 05:11
@mchmarny mchmarny added this to the M2 - KubeCon EU milestone Mar 18, 2026
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking findings

Testing gaps:

  1. Runtime depends on host-provided nccl-env-profile.sh at execution time:

    NCCL_LIB_DIR=/usr/local/nvidia/lib64 . /usr/local/nvidia/lib64/nccl-env-profile.sh &&

  2. NIC switch to eth0 is environment/runtime-validated (not CI-asserted in this PR):


@xdu31
Copy link
Copy Markdown
Contributor Author

xdu31 commented Mar 18, 2026

No blocking findings

Testing gaps:

  1. Runtime depends on host-provided nccl-env-profile.sh at execution time:

    NCCL_LIB_DIR=/usr/local/nvidia/lib64 . /usr/local/nvidia/lib64/nccl-env-profile.sh &&

  2. NIC switch to eth0 is environment/runtime-validated (not CI-asserted in this PR):

For the nccl-env-profile.sh dependency:
This is a deliberate design choice, not a gap. The whole point of this PR is to stop hardcoding NCCL vars and source them from the host instead. The profile is installed by the nccl-tcpxo-installer DaemonSet — it's a platform prerequisite on every GKE GPU node, same as having GPUs or the TCPXO plugin itself. If the file is missing, the shell entrypoint fails fast with a clear error rather than silently running with wrong values.

For NIC switch to eth0:
We unified on eth0 for consistency rather than having a mix, it is more like a cleanup, both are working, MPI just needs at least one routable interface that isn't a GPU NIC.

@yuanchen8911 yuanchen8911 merged commit e15a3c6 into NVIDIA:main Mar 18, 2026
15 checks passed
xdu31 added a commit to xdu31/aicr that referenced this pull request Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants