docs(gke):  hostNetwork requirement for TCPXO  and non-privileged workaround

## Summary

GKE GPUDirect-TCPXO requires `hostNetwork: true` for the TCPXO daemon sidecar to enumerate all GPUs via PCI sysfs. Without it, the daemon detects fewer GPUs than CUDA reports and exits. This is a GKE container runtime limitation, not an AICR issue.

`privileged: true` is **not required** when using NRI device injection. Capabilities are also not required.

## Configuration Matrix

Systematic testing on two independent GKE clusters (v1.35, a3-megagpu-8g, COS):

| hostNetwork | privileged | NRI | PCI GPUs | Works? |
|-------------|-----------|-----|----------|--------|
| false | true | no | 0/8 | **No** |
| false | true | yes | 7/8 | **No** |
| false | false | yes | 7/8 | **No** |
| true | false | no | no CUDA devices | **No** |
| **true** | **false** | **yes** | **8/8** | **Yes** |
| **true** | **true** | **no** | **8/8** | **Yes** |

Key findings:
- `hostNetwork: true` is **required** for full PCI sysfs visibility (8/8 GPUs)
- `privileged` and NRI are interchangeable for GPU device access
- Capabilities (`NET_ADMIN`, `NET_BIND_SERVICE`) are **not required** — NRI device injection alone is sufficient
- Without `hostNetwork`, PCI tree shows 7/8 GPUs (with NRI) or 0/8 (without NRI)

## Validated TCPXO Runtime Profiles

**Minimal secure (recommended):**
- `hostNetwork: true`
- `privileged: false`
- NRI annotations: `devices.gke.io/container.tcpxo-daemon` + `networking.gke.io/interfaces`
- No capabilities needed
- Requires NRI device injector DaemonSet (included in AICR bundle)
- **Result: 335 GB/s peak busBW, 87.2 GB/s avg**

**Fallback (privileged):**
- `hostNetwork: true` + `privileged: true`
- No NRI annotations needed
- **Result: 335 GB/s peak busBW, 87.2 GB/s avg**

## AICR Changes

PR #383 (`feat/gke-cos-training-overlays`) includes:
- NRI device injector in `gke-nccl-tcpxo` component
- TCPXO runtime requirements documented in `demos/cuj1-gke.md`
- NCCL test uses fallback profile (privileged) for broad compatibility
- GKE NCCL performance test in `pendingNCCLCombinations` (validator automation needs raw Pods + exec strategy)

## Follow-up

- [ ] Update NCCL test to minimal secure profile (NRI, non-privileged) once validated in production
- [ ] Automate GKE NCCL performance validation (requires multi-resource apply + pod exec strategy in validator)
- [ ] Consider auto-generating NRI annotations in workload templates (network names are cluster-specific)
- [ ] Track upstream issue resolution: https://github.com/GoogleCloudPlatform/container-engine-accelerators/issues/580


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(gke): hostNetwork requirement for TCPXO and non-privileged workaround #381

Summary

Configuration Matrix

Validated TCPXO Runtime Profiles

AICR Changes

Follow-up

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

hostNetwork	privileged	NRI	PCI GPUs	Works?
false	true	no	0/8	No
false	true	yes	7/8	No
false	false	yes	7/8	No
true	false	no	no CUDA devices	No
true	false	yes	8/8	Yes
true	true	no	8/8	Yes

docs(gke): hostNetwork requirement for TCPXO and non-privileged workaround #381

Description

Summary

Configuration Matrix

Validated TCPXO Runtime Profiles

AICR Changes

Follow-up

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions