Skip to content

fix(gke): update TCPXO to NRI profile without hostNetwork#420

Merged
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:docs/gke-tcpxo-hostnetwork
Mar 18, 2026
Merged

fix(gke): update TCPXO to NRI profile without hostNetwork#420
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:docs/gke-tcpxo-hostnetwork

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Mar 17, 2026

Summary

  • Update demos/workloads/training/gke-nccl-test-tcpxo.yaml to NRI profile with hostNetwork: false
  • Add docs/integrator/gke-tcpxo-networking.md documenting NRI profile, node pool prerequisites, and troubleshooting
  • Update demos/cuj1-gke.md with NRI profile as default

Root cause of previous 7/8 GPU issue

The GPU node pool was provisioned with an extra gVNIC additional network that occupied PCI address 0000:06:00.0 — one of the 8 GPU NIC slots. Removing the gVNIC from the node pool config resolves the issue. This was not a GKE version regression.

NRI profile (no hostNetwork)

  • hostNetwork: false — preserves pod networking
  • privileged: false — tcpxo-daemon uses capabilities only
  • /sys mounted as /hostsysfs for PCI sysfs visibility
  • Validated on GKE 1.35: 338 GB/s AllReduce busBW (8/8 GPUs)

Resolves #381

Test plan

  • NRI profile validated on GKE 1.35 (aicr-demo4) — 338 GB/s AllReduce, 8/8 GPUs
  • Confirmed root cause: gVNIC network on node pool displaces GPU NIC PCI slot
  • Verify markdown renders correctly on GitHub

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner March 17, 2026 18:54
@yuanchen8911 yuanchen8911 added documentation Improvements or additions to documentation area/docs labels Mar 17, 2026
@yuanchen8911 yuanchen8911 force-pushed the docs/gke-tcpxo-hostnetwork branch from bdcd39c to 20f2b33 Compare March 17, 2026 20:27
@yuanchen8911 yuanchen8911 requested a review from mchmarny March 18, 2026 00:19
@yuanchen8911 yuanchen8911 force-pushed the docs/gke-tcpxo-hostnetwork branch from 20f2b33 to 28c77fe Compare March 18, 2026 02:07
@github-actions github-actions bot added size/XL and removed size/M labels Mar 18, 2026
@yuanchen8911 yuanchen8911 changed the title docs(gke): document hostNetwork requirement for TCPXO networking docs(gke): document TCPXO networking profiles and NRI recommendation Mar 18, 2026
Copy link
Copy Markdown

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivy found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@yuanchen8911 yuanchen8911 force-pushed the docs/gke-tcpxo-hostnetwork branch 2 times, most recently from a437389 to c65fcc0 Compare March 18, 2026 02:16
@yuanchen8911 yuanchen8911 changed the title docs(gke): document TCPXO networking profiles and NRI recommendation fix(gke): update TCPXO test manifest to NRI profile with hostNetwork Mar 18, 2026
@github-actions github-actions bot added size/L and removed size/XL labels Mar 18, 2026
@yuanchen8911 yuanchen8911 force-pushed the docs/gke-tcpxo-hostnetwork branch from c65fcc0 to 986515b Compare March 18, 2026 02:19
@mchmarny mchmarny added this to the M2 - KubeCon EU milestone Mar 18, 2026
@mchmarny mchmarny requested a review from atif1996 March 18, 2026 12:28
@yuanchen8911 yuanchen8911 force-pushed the docs/gke-tcpxo-hostnetwork branch from 4cd1d70 to a84930e Compare March 18, 2026 20:37
@yuanchen8911 yuanchen8911 changed the title fix(gke): update TCPXO test manifest to NRI profile with hostNetwork fix(gke): update TCPXO to NRI profile without hostNetwork Mar 18, 2026
Update TCPXO documentation and demo manifest to use the NRI profile
with hostNetwork: false. The previous 7/8 GPU issue was caused by a
gVNIC additional network on the GPU node pool taking a GPU NIC PCI slot,
not a GKE version regression.

- Update gke-nccl-test-tcpxo.yaml to hostNetwork: false (NRI profile)
  with RxDM v1.0.20 and NCCL plugin v1.0.14
- Document GPU node pool prerequisite: no gVNIC additional network
- Add troubleshooting for 7/8 and 0/8 GPU detection issues
- Update demos/cuj1-gke.md to remove hostNetwork fallback references
- Validated on GKE 1.35 with ~338 GB/s AllReduce busBW

Resolves NVIDIA#381

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
@yuanchen8911 yuanchen8911 force-pushed the docs/gke-tcpxo-hostnetwork branch from a84930e to c852709 Compare March 18, 2026 20:46
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@yuanchen8911 yuanchen8911 merged commit f2ec6b2 into NVIDIA:main Mar 18, 2026
22 checks passed
xdu31 pushed a commit to xdu31/aicr that referenced this pull request Mar 24, 2026
Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docs documentation Improvements or additions to documentation size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docs(gke): hostNetwork requirement for TCPXO and non-privileged workaround

2 participants