GPUDirect-TCPXO daemon sees incomplete PCI GPU inventory without hostNetwork on GKE A3 Mega

## Summary
On GKE, TCPXO daemon (`fastrak_gpumem_manager`, v1.0.21) fails when `hostNetwork: false`, even with NRI device injection and privileged mode. The daemon detects fewer GPUs from PCI sysfs than CUDA and exits.

## Environment
- GKE: `v1.35.0-gke.3047002`
- Node type: `a3-megagpu-8g` (8x H100)
- OS image: Container-Optimized OS (COS), kernel `6.12.55+`
- TCPXO daemon image: `us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.21`
- NRI injector deployed and injecting expected device nodes
- Reproduced on two independent GKE clusters

## Expected behavior
With NRI annotations providing `/dev/nvidia*` device access, TCPXO daemon should work without requiring `hostNetwork: true`.

## Observed behavior

Systematic testing of all configuration combinations:

| hostNetwork | privileged | NRI | PCI GPUs | Works? |
|-------------|-----------|-----|----------|--------|
| false | true | no | 0/8 | **No** |
| false | true | yes | 7/8 | **No** |
| false | false | yes | 7/8 | **No** |
| true | false | no | no CUDA devices | **No** |
| **true** | **false** | **yes** | **8/8** | **Yes** |
| **true** | **true** | **no** | **8/8** | **Yes** |

Key observations:
- `hostNetwork: true` is **required** for full PCI sysfs visibility (8/8 GPUs)
- Without `hostNetwork`, NRI provides `/dev/nvidia*` but PCI tree shows only 7/8 GPUs
- Without `hostNetwork` and without NRI, PCI tree shows 0/8 GPUs
- `privileged` and NRI are interchangeable for GPU device access, but neither fixes PCI visibility without `hostNetwork`

## Daemon error log
```
E0312 fastrak_gpumem_manager.cc:200] Number of GPUs detected in the PCI tree 7 is not equal to the actual number of GPUs reported by CUDA 8.
E0312 fastrak_gpumem_manager_startup.cc:45] Exiting with result:1
```

## Suspected root cause
PCI/sysfs visibility is tied to the network namespace. When `hostNetwork=false`, the container gets an isolated network namespace which restricts PCI sysfs enumeration — one GPU is missing from the PCI tree regardless of device access or privilege level. `hostNetwork: true` shares the host network namespace, restoring full PCI sysfs visibility.

NRI device injection correctly provides `/dev/nvidia*` access (CUDA sees 8 GPUs) but does not affect PCI sysfs visibility (daemon sees 7).

## Workaround
Minimal secure configuration:
- `hostNetwork: true`
- NRI annotations for GPU device injection
- `privileged: false` (not required)

Full TCPXO bandwidth achieved: **335 GB/s** peak busBW (AllReduce, 8 GB message size, 2 nodes × 8 H100).

## Impact
TCPXO cannot run in fully isolated pod networking mode (`hostNetwork=false`). All TCPXO workloads require `hostNetwork: true`.

## Questions
1. Is `hostNetwork: true` now a required condition for TCPXO on GKE 1.35 + daemon v1.0.21, or is this a regression from earlier versions?
2. Is the PCI sysfs filtering (7/8 GPUs visible without `hostNetwork`) expected behavior in the container runtime, or a bug?
3. Could the daemon be made tolerant of PCI/CUDA GPU count mismatch when the missing GPU is still functional via CUDA?

We can provide full manifests, `kubectl describe pod`, and daemon logs as a tarball if needed.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPUDirect-TCPXO daemon sees incomplete PCI GPU inventory without hostNetwork on GKE A3 Mega #580

Summary

Environment

Expected behavior

Observed behavior

Daemon error log

Suspected root cause

Workaround

Impact

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

hostNetwork	privileged	NRI	PCI GPUs	Works?
false	true	no	0/8	No
false	true	yes	7/8	No
false	false	yes	7/8	No
true	false	no	no CUDA devices	No
true	false	yes	8/8	Yes
true	true	no	8/8	Yes

GPUDirect-TCPXO daemon sees incomplete PCI GPU inventory without hostNetwork on GKE A3 Mega #580

Description

Summary

Environment

Expected behavior

Observed behavior

Daemon error log

Suspected root cause

Workaround

Impact

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions