fix(validator): source NCCL env from host profile instead of hardcoding by xdu31 · Pull Request #422 · NVIDIA/aicr

xdu31 · 2026-03-18T05:11:59Z

Summary

Replace 23 hardcoded NCCL environment variables in the GKE H100 NCCL
TrainingRuntime with dynamic sourcing from the host's nccl-env-profile.sh
(installed by nccl-tcpxo-installer DaemonSet)
Fix MPI/UCX control NIC from eth1 to eth0 to match actual GKE
a3-megagpu-8g host network layout
Change NCCL_DEBUG from INFO to WARN to prevent Kubernetes log
buffer overflow that truncated benchmark results

Motivation

The hardcoded NCCL variables broke on clusters where the TCPXO guest
config checker enforced different values than our template. Specific
failures included NCCL_PROTO=Simple,LL128 (expected Simple),
missing NCCL_ALGO=Ring,Tree, missing NCCL_NVLS_ENABLE=0, and
incorrect NIC numbering (eth1-eth9 vs actual eth0-eth8).

Rather than maintaining a fragile copy of version-specific values,
workers now source nccl-env-profile.sh from the host at startup.
This profile is installed by the nccl-tcpxo-installer DaemonSet and
always matches the installed TCPXO version.

How it works

Worker entrypoint sources nccl-env-profile.sh from the host mount
Exports NCCL_* and CUDA_* vars to ~/.ssh/environment
Enables PermitUserEnvironment in sshd so MPI SSH sessions inherit vars
Launcher only passes MPI/UCX transport args (--mca, UCX_NET_DEVICES)
plus LD_LIBRARY_PATH and NCCL_DEBUG — no NCCL tuning vars

Test plan

Run NCCL All-Reduce validation on GKE H100 cluster (a3-megagpu-8g)
Verify bandwidth ~335 GB/s at 8 GB message size (2 nodes)
Verify no guest config checker mismatches in launcher logs
Verify benchmark table appears in logs (not truncated)

yuanchen8911

No blocking findings

Testing gaps:

Runtime depends on host-provided nccl-env-profile.sh at execution time:

aicr/validators/performance/testdata/h100/gke/runtime.yaml

Line 198 in 68b2dfb

NCCL_LIB_DIR=/usr/local/nvidia/lib64 . /usr/local/nvidia/lib64/nccl-env-profile.sh &&
NIC switch to eth0 is environment/runtime-validated (not CI-asserted in this PR):

aicr/validators/performance/testdata/h100/gke/runtime.yaml

Line 89 in 68b2dfb

- oob_tcp_if_include

aicr/validators/performance/testdata/h100/gke/runtime.yaml

Line 95 in 68b2dfb

- UCX_NET_DEVICES=eth0

xdu31 · 2026-03-18T16:28:22Z

No blocking findings

Testing gaps:

Runtime depends on host-provided nccl-env-profile.sh at execution time:

aicr/validators/performance/testdata/h100/gke/runtime.yaml

Line 198 in 68b2dfb

NCCL_LIB_DIR=/usr/local/nvidia/lib64 . /usr/local/nvidia/lib64/nccl-env-profile.sh &&

NIC switch to eth0 is environment/runtime-validated (not CI-asserted in this PR):

aicr/validators/performance/testdata/h100/gke/runtime.yaml

Line 89 in 68b2dfb

- oob_tcp_if_include

aicr/validators/performance/testdata/h100/gke/runtime.yaml

Line 95 in 68b2dfb

- UCX_NET_DEVICES=eth0

For the nccl-env-profile.sh dependency:
This is a deliberate design choice, not a gap. The whole point of this PR is to stop hardcoding NCCL vars and source them from the host instead. The profile is installed by the nccl-tcpxo-installer DaemonSet — it's a platform prerequisite on every GKE GPU node, same as having GPUs or the TCPXO plugin itself. If the file is missing, the shell entrypoint fails fast with a clear error rather than silently running with wrong values.

For NIC switch to eth0:
We unified on eth0 for consistency rather than having a mix, it is more like a cleanup, both are working, MPI just needs at least one routable interface that isn't a GPU NIC.

…ng (NVIDIA#422)

xdu31 requested a review from a team as a code owner March 18, 2026 05:11

github-actions bot added the size/M label Mar 18, 2026

fix(validator): source NCCL env from host profile instead of hardcoding

68b2dfb

xdu31 force-pushed the fix/gke-nccl branch from abfd4c7 to 68b2dfb Compare March 18, 2026 07:07

mchmarny assigned xdu31 Mar 18, 2026

mchmarny added this to the M2 - KubeCon EU milestone Mar 18, 2026

yuanchen8911 approved these changes Mar 18, 2026

View reviewed changes

Merge branch 'main' into fix/gke-nccl

41bc0c4

yuanchen8911 merged commit e15a3c6 into NVIDIA:main Mar 18, 2026
15 checks passed

xdu31 added a commit to xdu31/aicr that referenced this pull request Mar 24, 2026

fix(validator): source NCCL env from host profile instead of hardcodi…

b7856bb

…ng (NVIDIA#422)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(validator): source NCCL env from host profile instead of hardcoding#422

fix(validator): source NCCL env from host profile instead of hardcoding#422
yuanchen8911 merged 2 commits intoNVIDIA:mainfrom
xdu31:fix/gke-nccl

xdu31 commented Mar 18, 2026

Uh oh!

yuanchen8911 left a comment •

edited

Loading

Uh oh!

xdu31 commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xdu31 commented Mar 18, 2026

Summary

Motivation

How it works

Test plan

Uh oh!

yuanchen8911 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xdu31 commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuanchen8911 left a comment •

edited

Loading