-
Notifications
You must be signed in to change notification settings - Fork 22
Closed
Copy link
Labels
area/recipesdocumentationImprovements or additions to documentationImprovements or additions to documentation
Milestone
Description
Summary
GKE GPUDirect-TCPXO requires hostNetwork: true for the TCPXO daemon sidecar to enumerate all GPUs via PCI sysfs. Without it, the daemon detects fewer GPUs than CUDA reports and exits. This is a GKE container runtime limitation, not an AICR issue.
privileged: true is not required when using NRI device injection. Capabilities are also not required.
Configuration Matrix
Systematic testing on two independent GKE clusters (v1.35, a3-megagpu-8g, COS):
| hostNetwork | privileged | NRI | PCI GPUs | Works? |
|---|---|---|---|---|
| false | true | no | 0/8 | No |
| false | true | yes | 7/8 | No |
| false | false | yes | 7/8 | No |
| true | false | no | no CUDA devices | No |
| true | false | yes | 8/8 | Yes |
| true | true | no | 8/8 | Yes |
Key findings:
hostNetwork: trueis required for full PCI sysfs visibility (8/8 GPUs)privilegedand NRI are interchangeable for GPU device access- Capabilities (
NET_ADMIN,NET_BIND_SERVICE) are not required — NRI device injection alone is sufficient - Without
hostNetwork, PCI tree shows 7/8 GPUs (with NRI) or 0/8 (without NRI)
Validated TCPXO Runtime Profiles
Minimal secure (recommended):
hostNetwork: trueprivileged: false- NRI annotations:
devices.gke.io/container.tcpxo-daemon+networking.gke.io/interfaces - No capabilities needed
- Requires NRI device injector DaemonSet (included in AICR bundle)
- Result: 335 GB/s peak busBW, 87.2 GB/s avg
Fallback (privileged):
hostNetwork: true+privileged: true- No NRI annotations needed
- Result: 335 GB/s peak busBW, 87.2 GB/s avg
AICR Changes
PR #383 (feat/gke-cos-training-overlays) includes:
- NRI device injector in
gke-nccl-tcpxocomponent - TCPXO runtime requirements documented in
demos/cuj1-gke.md - NCCL test uses fallback profile (privileged) for broad compatibility
- GKE NCCL performance test in
pendingNCCLCombinations(validator automation needs raw Pods + exec strategy)
Follow-up
- Update NCCL test to minimal secure profile (NRI, non-privileged) once validated in production
- Automate GKE NCCL performance validation (requires multi-resource apply + pod exec strategy in validator)
- Consider auto-generating NRI annotations in workload templates (network names are cluster-specific)
- Track upstream issue resolution: GPUDirect-TCPXO daemon sees incomplete PCI GPU inventory without hostNetwork on GKE A3 Mega GoogleCloudPlatform/container-engine-accelerators#580
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area/recipesdocumentationImprovements or additions to documentationImprovements or additions to documentation