Skip to content

docs(gke): hostNetwork requirement for TCPXO and non-privileged workaround #381

@yuanchen8911

Description

@yuanchen8911

Summary

GKE GPUDirect-TCPXO requires hostNetwork: true for the TCPXO daemon sidecar to enumerate all GPUs via PCI sysfs. Without it, the daemon detects fewer GPUs than CUDA reports and exits. This is a GKE container runtime limitation, not an AICR issue.

privileged: true is not required when using NRI device injection. Capabilities are also not required.

Configuration Matrix

Systematic testing on two independent GKE clusters (v1.35, a3-megagpu-8g, COS):

hostNetwork privileged NRI PCI GPUs Works?
false true no 0/8 No
false true yes 7/8 No
false false yes 7/8 No
true false no no CUDA devices No
true false yes 8/8 Yes
true true no 8/8 Yes

Key findings:

  • hostNetwork: true is required for full PCI sysfs visibility (8/8 GPUs)
  • privileged and NRI are interchangeable for GPU device access
  • Capabilities (NET_ADMIN, NET_BIND_SERVICE) are not required — NRI device injection alone is sufficient
  • Without hostNetwork, PCI tree shows 7/8 GPUs (with NRI) or 0/8 (without NRI)

Validated TCPXO Runtime Profiles

Minimal secure (recommended):

  • hostNetwork: true
  • privileged: false
  • NRI annotations: devices.gke.io/container.tcpxo-daemon + networking.gke.io/interfaces
  • No capabilities needed
  • Requires NRI device injector DaemonSet (included in AICR bundle)
  • Result: 335 GB/s peak busBW, 87.2 GB/s avg

Fallback (privileged):

  • hostNetwork: true + privileged: true
  • No NRI annotations needed
  • Result: 335 GB/s peak busBW, 87.2 GB/s avg

AICR Changes

PR #383 (feat/gke-cos-training-overlays) includes:

  • NRI device injector in gke-nccl-tcpxo component
  • TCPXO runtime requirements documented in demos/cuj1-gke.md
  • NCCL test uses fallback profile (privileged) for broad compatibility
  • GKE NCCL performance test in pendingNCCLCombinations (validator automation needs raw Pods + exec strategy)

Follow-up

Metadata

Metadata

Labels

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions