Skip to content

GPUDirect-TCPXO daemon sees incomplete PCI GPU inventory without hostNetwork on GKE A3 Mega #580

@yuanchen8911

Description

@yuanchen8911

Summary

On GKE, TCPXO daemon (fastrak_gpumem_manager, v1.0.21) fails when hostNetwork: false, even with NRI device injection and privileged mode. The daemon detects fewer GPUs from PCI sysfs than CUDA and exits.

Environment

  • GKE: v1.35.0-gke.3047002
  • Node type: a3-megagpu-8g (8x H100)
  • OS image: Container-Optimized OS (COS), kernel 6.12.55+
  • TCPXO daemon image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.21
  • NRI injector deployed and injecting expected device nodes
  • Reproduced on two independent GKE clusters

Expected behavior

With NRI annotations providing /dev/nvidia* device access, TCPXO daemon should work without requiring hostNetwork: true.

Observed behavior

Systematic testing of all configuration combinations:

hostNetwork privileged NRI PCI GPUs Works?
false true no 0/8 No
false true yes 7/8 No
false false yes 7/8 No
true false no no CUDA devices No
true false yes 8/8 Yes
true true no 8/8 Yes

Key observations:

  • hostNetwork: true is required for full PCI sysfs visibility (8/8 GPUs)
  • Without hostNetwork, NRI provides /dev/nvidia* but PCI tree shows only 7/8 GPUs
  • Without hostNetwork and without NRI, PCI tree shows 0/8 GPUs
  • privileged and NRI are interchangeable for GPU device access, but neither fixes PCI visibility without hostNetwork

Daemon error log

E0312 fastrak_gpumem_manager.cc:200] Number of GPUs detected in the PCI tree 7 is not equal to the actual number of GPUs reported by CUDA 8.
E0312 fastrak_gpumem_manager_startup.cc:45] Exiting with result:1

Suspected root cause

PCI/sysfs visibility is tied to the network namespace. When hostNetwork=false, the container gets an isolated network namespace which restricts PCI sysfs enumeration — one GPU is missing from the PCI tree regardless of device access or privilege level. hostNetwork: true shares the host network namespace, restoring full PCI sysfs visibility.

NRI device injection correctly provides /dev/nvidia* access (CUDA sees 8 GPUs) but does not affect PCI sysfs visibility (daemon sees 7).

Workaround

Minimal secure configuration:

  • hostNetwork: true
  • NRI annotations for GPU device injection
  • privileged: false (not required)

Full TCPXO bandwidth achieved: 335 GB/s peak busBW (AllReduce, 8 GB message size, 2 nodes × 8 H100).

Impact

TCPXO cannot run in fully isolated pod networking mode (hostNetwork=false). All TCPXO workloads require hostNetwork: true.

Questions

  1. Is hostNetwork: true now a required condition for TCPXO on GKE 1.35 + daemon v1.0.21, or is this a regression from earlier versions?
  2. Is the PCI sysfs filtering (7/8 GPUs visible without hostNetwork) expected behavior in the container runtime, or a bug?
  3. Could the daemon be made tolerant of PCI/CUDA GPU count mismatch when the missing GPU is still functional via CUDA?

We can provide full manifests, kubectl describe pod, and daemon logs as a tarball if needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions