-
Notifications
You must be signed in to change notification settings - Fork 180
Description
Summary
On GKE, TCPXO daemon (fastrak_gpumem_manager, v1.0.21) fails when hostNetwork: false, even with NRI device injection and privileged mode. The daemon detects fewer GPUs from PCI sysfs than CUDA and exits.
Environment
- GKE:
v1.35.0-gke.3047002 - Node type:
a3-megagpu-8g(8x H100) - OS image: Container-Optimized OS (COS), kernel
6.12.55+ - TCPXO daemon image:
us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.21 - NRI injector deployed and injecting expected device nodes
- Reproduced on two independent GKE clusters
Expected behavior
With NRI annotations providing /dev/nvidia* device access, TCPXO daemon should work without requiring hostNetwork: true.
Observed behavior
Systematic testing of all configuration combinations:
| hostNetwork | privileged | NRI | PCI GPUs | Works? |
|---|---|---|---|---|
| false | true | no | 0/8 | No |
| false | true | yes | 7/8 | No |
| false | false | yes | 7/8 | No |
| true | false | no | no CUDA devices | No |
| true | false | yes | 8/8 | Yes |
| true | true | no | 8/8 | Yes |
Key observations:
hostNetwork: trueis required for full PCI sysfs visibility (8/8 GPUs)- Without
hostNetwork, NRI provides/dev/nvidia*but PCI tree shows only 7/8 GPUs - Without
hostNetworkand without NRI, PCI tree shows 0/8 GPUs privilegedand NRI are interchangeable for GPU device access, but neither fixes PCI visibility withouthostNetwork
Daemon error log
E0312 fastrak_gpumem_manager.cc:200] Number of GPUs detected in the PCI tree 7 is not equal to the actual number of GPUs reported by CUDA 8.
E0312 fastrak_gpumem_manager_startup.cc:45] Exiting with result:1
Suspected root cause
PCI/sysfs visibility is tied to the network namespace. When hostNetwork=false, the container gets an isolated network namespace which restricts PCI sysfs enumeration — one GPU is missing from the PCI tree regardless of device access or privilege level. hostNetwork: true shares the host network namespace, restoring full PCI sysfs visibility.
NRI device injection correctly provides /dev/nvidia* access (CUDA sees 8 GPUs) but does not affect PCI sysfs visibility (daemon sees 7).
Workaround
Minimal secure configuration:
hostNetwork: true- NRI annotations for GPU device injection
privileged: false(not required)
Full TCPXO bandwidth achieved: 335 GB/s peak busBW (AllReduce, 8 GB message size, 2 nodes × 8 H100).
Impact
TCPXO cannot run in fully isolated pod networking mode (hostNetwork=false). All TCPXO workloads require hostNetwork: true.
Questions
- Is
hostNetwork: truenow a required condition for TCPXO on GKE 1.35 + daemon v1.0.21, or is this a regression from earlier versions? - Is the PCI sysfs filtering (7/8 GPUs visible without
hostNetwork) expected behavior in the container runtime, or a bug? - Could the daemon be made tolerant of PCI/CUDA GPU count mismatch when the missing GPU is still functional via CUDA?
We can provide full manifests, kubectl describe pod, and daemon logs as a tarball if needed.