fix(gke): update TCPXO to NRI profile without hostNetwork#420
Merged
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom Mar 18, 2026
Merged
fix(gke): update TCPXO to NRI profile without hostNetwork#420yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
Conversation
4 tasks
bdcd39c to
20f2b33
Compare
20f2b33 to
28c77fe
Compare
There was a problem hiding this comment.
Trivy found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
a437389 to
c65fcc0
Compare
c65fcc0 to
986515b
Compare
4cd1d70 to
a84930e
Compare
Update TCPXO documentation and demo manifest to use the NRI profile with hostNetwork: false. The previous 7/8 GPU issue was caused by a gVNIC additional network on the GPU node pool taking a GPU NIC PCI slot, not a GKE version regression. - Update gke-nccl-test-tcpxo.yaml to hostNetwork: false (NRI profile) with RxDM v1.0.20 and NCCL plugin v1.0.14 - Document GPU node pool prerequisite: no gVNIC additional network - Add troubleshooting for 7/8 and 0/8 GPU detection issues - Update demos/cuj1-gke.md to remove hostNetwork fallback references - Validated on GKE 1.35 with ~338 GB/s AllReduce busBW Resolves NVIDIA#381 Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
a84930e to
c852709
Compare
5 tasks
xdu31
pushed a commit
to xdu31/aicr
that referenced
this pull request
Mar 24, 2026
Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
demos/workloads/training/gke-nccl-test-tcpxo.yamlto NRI profile withhostNetwork: falsedocs/integrator/gke-tcpxo-networking.mddocumenting NRI profile, node pool prerequisites, and troubleshootingdemos/cuj1-gke.mdwith NRI profile as defaultRoot cause of previous 7/8 GPU issue
The GPU node pool was provisioned with an extra gVNIC additional network that occupied PCI address
0000:06:00.0— one of the 8 GPU NIC slots. Removing the gVNIC from the node pool config resolves the issue. This was not a GKE version regression.NRI profile (no hostNetwork)
hostNetwork: false— preserves pod networkingprivileged: false— tcpxo-daemon uses capabilities only/sysmounted as/hostsysfsfor PCI sysfs visibilityResolves #381
Test plan