1. Network Overview

This lab implements an AI/GPU cluster network using a Cisco Nexus 9332PQ spine-leaf fabric with a rail-based L3 routing design optimized for RDMA traffic between GPU servers.

Key Design Decisions

  • eBGP underlay (no OSPF, no EVPN/VXLAN) — each switch has its own AS number
  • Rail design — each leaf = one rail, each NIC on a server maps to exactly one leaf/rail
  • No LACP — rails provide parallelism without hashing latency
  • Per-NIC policy routing on servers — source-based routing tables for deterministic paths
  • Cross-rail traffic routed through spines via eBGP (ECMP with 2 paths)
  • Jumbo MTU — 9216 on fabric, 9000 on servers
  • BFD enabled on all eBGP peerings for fast failure detection

2. Physical Topology

GPU AI Cluster — Spine-Leaf Fabric Cisco Nexus 9332PQ | eBGP L3 Rail Routing | 40G Fabric | February 2026 MGMT SPINE LAYER LEAF LAYER COMPUTE Lab_3750X Management Switch 192.168.51.142 mgmt mgmt NX_AI_Spine1 Nexus 9332PQ | NX-OS 9.3(13) BGP AS 65000 Lo0: 10.2.0.1 Mgmt: 192.168.51.232 NX_AI_Spine2 Nexus 9332PQ | NX-OS 9.3(13) BGP AS 65000 Lo0: 10.2.0.2 Mgmt: 192.168.51.231 Eth1/14 40G Eth1/14 Eth1/18 40G Eth1/18 Eth1/13 40G Eth1/13 Eth1/17 40G Eth1/17 NX_AI_Leaf1 Nexus 9332PQ | NX-OS 9.3(13) BGP AS 65001 | Rail 0 Lo0: 10.2.0.3 | SVI Vlan100: 10.0.0.254/24 Mgmt: 192.168.50.229 VLAN 100 (Rail 0) NX_AI_Leaf2 Nexus 9332PQ | NX-OS 9.3(13) BGP AS 65002 | Rail 1 Lo0: 10.2.0.4 | SVI Vlan101: 10.0.1.254/24 Mgmt: 192.168.51.230 VLAN 101 (Rail 1) Eth1/27 40G ens6d1 Eth1/28 40G ens6d1 Eth1/27 40G ens6 Eth1/28 40G ens6 Eth1/1/1 10G vmnic5 gpuserver1 Ubuntu | Mellanox ConnectX-3 Pro Mgmt: 192.168.51.73 ens6d1: 10.0.0.1 | tbl 100 gw .254 ens6: 10.0.1.1 | tbl 101 gw .254 Rail 0: ens6d1 Rail 1: ens6 gpuserver2 Ubuntu | Mellanox ConnectX-3 Pro Mgmt: 192.168.51.71 ens6d1: 10.0.0.2 | tbl 100 gw .254 ens6: 10.0.1.2 | tbl 101 gw .254 Rail 0: ens6d1 Rail 1: ens6 ESXi Host VMware ESXi 7.0 192.168.50.32 vmnic5 (10G) Fabric Summary Routing: eBGP (AS 65000 Spines, 65001 Leaf1, 65002 Leaf2) Design: L3 Rail Routing (no EVPN/VXLAN) Rail 0: VLAN 100, 10.0.0.0/24 (Leaf1 SVI .254) Rail 1: VLAN 101, 10.0.1.0/24 (Leaf2 SVI .254) MTU: 9000 (Servers), 9216 (Fabric) Servers: Per-NIC policy routing (tbl 100/101) All fabric links: 40G QSFP+ | BFD enabled Platform: Cisco Nexus 9332PQ | NX-OS 9.3(13) Legend Spine-Leaf 40G Fabric Rail 0: Leaf1 → Server (ens6d1) Rail 1: Leaf2 → Server (ens6) ESXi 10G Management

3. Device Inventory & IP Addressing

Network Switches

Device Role Platform NX-OS Mgmt IP Loopback0 BGP AS
NX_AI_Spine1 Spine Nexus 9332PQ 9.3(13) 192.168.51.232 10.2.0.1/32 65000
NX_AI_Spine2 Spine Nexus 9332PQ 9.3(13) 192.168.51.231 10.2.0.2/32 65000
NX_AI_Leaf1 Leaf Rail 0 Nexus 9332PQ 9.3(13) 192.168.50.229 10.2.0.3/32 65001
NX_AI_Leaf2 Leaf Rail 1 Nexus 9332PQ 9.3(13) 192.168.51.230 10.2.0.4/32 65002

GPU Servers

Server OS NIC Mgmt IP ens6d1 (Rail 0) ens6 (Rail 1)
gpuserver1 Ubuntu Mellanox ConnectX-3 Pro (Dual-port 40G) 192.168.51.73 10.0.0.1/24 10.0.1.1/24
gpuserver2 Ubuntu Mellanox ConnectX-3 Pro (Dual-port 40G) 192.168.51.71 10.0.0.2/24 10.0.1.2/24

Other Devices

DeviceRoleIP
Lab_3750XManagement Switch192.168.51.142
ESXi HostVMware ESXi 7.0192.168.50.32

5. eBGP Underlay Configuration

Design: eBGP with physical interface peering (no loopback peering, no OSPF). Each tier has its own AS number. Leaves have maximum-paths 2 for ECMP across both spines. BFD is enabled on all BGP sessions for sub-second failover.

DeviceBGP ASRouter-IDNetworks Advertised
Spine16500010.2.0.110.2.0.1/32
Spine26500010.2.0.210.2.0.2/32
Leaf16500110.2.0.310.2.0.3/32, 10.3.0.1/32, 10.0.0.0/24
Leaf26500210.2.0.410.2.0.4/32, 10.3.0.2/32, 10.0.1.0/24

5.1 NX_AI_Spine1 — BGP Configuration

! NX_AI_Spine1 (192.168.51.232) - AS 65000

router bgp 65000
  router-id 10.2.0.1
  address-family ipv4 unicast
    network 10.2.0.1/32

  ! Peer to Leaf1 via Eth1/14
  neighbor 10.4.0.5
    remote-as 65001
    description to-NX-AI-Leaf1
    bfd
    address-family ipv4 unicast

  ! Peer to Leaf2 via Eth1/18
  neighbor 10.4.0.14
    remote-as 65002
    description to-NX-AI-Leaf2
    bfd
    address-family ipv4 unicast

5.2 NX_AI_Spine2 — BGP Configuration

! NX_AI_Spine2 (192.168.51.231) - AS 65000

router bgp 65000
  router-id 10.2.0.2
  address-family ipv4 unicast
    network 10.2.0.2/32

  ! Peer to Leaf1 via Eth1/13
  neighbor 10.4.0.1
    remote-as 65001
    description to-NX-AI-Leaf1
    bfd
    address-family ipv4 unicast

  ! Peer to Leaf2 via Eth1/17
  neighbor 10.4.0.9
    remote-as 65002
    description to-NX-AI-Leaf2
    bfd
    address-family ipv4 unicast

5.3 NX_AI_Leaf1 — BGP Configuration

! NX_AI_Leaf1 (192.168.50.229) - AS 65001

router bgp 65001
  router-id 10.2.0.3
  address-family ipv4 unicast
    network 10.2.0.3/32
    network 10.3.0.1/32
    network 10.0.0.0/24           ! Rail 0 SVI subnet
    maximum-paths 2              ! ECMP across both spines

  ! Peer to Spine1 via Eth1/14
  neighbor 10.4.0.6
    remote-as 65000
    description to-NX-AI-Spine1
    bfd
    address-family ipv4 unicast

  ! Peer to Spine2 via Eth1/13
  neighbor 10.4.0.2
    remote-as 65000
    description to-NX-AI-Spine2
    bfd
    address-family ipv4 unicast

5.4 NX_AI_Leaf2 — BGP Configuration

! NX_AI_Leaf2 (192.168.51.230) - AS 65002

router bgp 65002
  router-id 10.2.0.4
  address-family ipv4 unicast
    network 10.2.0.4/32
    network 10.3.0.2/32
    network 10.0.1.0/24           ! Rail 1 SVI subnet
    maximum-paths 2              ! ECMP across both spines

  ! Peer to Spine1 via Eth1/18
  neighbor 10.4.0.13
    remote-as 65000
    description to-NX-AI-Spine1
    bfd
    address-family ipv4 unicast

  ! Peer to Spine2 via Eth1/17
  neighbor 10.4.0.10
    remote-as 65000
    description to-NX-AI-Spine2
    bfd
    address-family ipv4 unicast

6. QoS / PFC / ECN Configuration (Lossless RDMA)

All 4 switches (Spine1, Spine2, Leaf1, Leaf2) run identical QoS configuration for lossless RoCE v2. RDMA traffic is classified by DSCP 26 and RoCE UDP ports (4741/4791), mapped to CoS 3 / qos-group 3, with Priority Flow Control (PFC) preventing packet drops and ECN signaling congestion before queues overflow.

Configuration audited and cleaned up February 2026. Unused leftover class-maps (RDM, RDMA_2, RDMA_Class) and policy-maps (ROCE_NET_POLICY, testcos) were removed from all switches.

6.1 Classification & Marking

ACL — RoCE UDP Port Matching

ip access-list rdma
  10 permit udp any any eq 4741
  20 permit udp any eq 4741 any
  30 permit udp any eq 4791 any
  40 permit udp any any eq 4791
! UDP 4741 = RoCE v1, UDP 4791 = RoCE v2

Class Maps — RDMA Traffic Identification

! Match DSCP 26 (CS3/AF31 — standard RoCE marking)
class-map type qos match-all RDMA
  match dscp 26

! Match DSCP 26 OR RoCE UDP ports (broader catch-all)
class-map type qos match-any RDMA_UDP
  match dscp 26
  match access-group name rdma

Input Marking Policy

! Classify RDMA traffic into qos-group 3 for downstream processing
policy-map type qos QOS_MARKING
  class RDMA
    set qos-group 3
  class RDMA_UDP
    set qos-group 3

Classification Flow

Ingress packet → DSCP 26? or UDP 4741/4791?qos-group 3CoS 3 queue → PFC protected + ECN marked

6.2 Network QoS (PFC + MTU)

! Network QoS: controls MTU per queue and PFC behavior
policy-map type network-qos QOS_NETWORK
  class type network-qos c-nq3
    mtu 9216              ← jumbo frames for RDMA queue
    pause pfc-cos 3       ← IEEE 802.1Qbb PFC on CoS 3
  class type network-qos c-nq-default
    mtu 9216              ← jumbo for all other traffic too

Per-Interface PFC Settings

! Applied to ALL server-facing and fabric-facing interfaces:
priority-flow-control mode on

! Global features enabled:
feature lldp     ← Link Layer Discovery Protocol
feature dcbx     ← Data Center Bridging Capability Exchange

How PFC Prevents RDMA Packet Loss

When a switch queue for CoS 3 fills to a threshold, PFC sends an IEEE 802.1Qbb PAUSE frame back to the upstream sender, telling it to stop transmitting on that priority class. This creates a lossless fabric — the upstream device buffers packets instead of the downstream device dropping them. Without PFC, RoCE performance degrades catastrophically because RDMA relies on the transport being lossless.

6.3 Egress Queuing (ECN + Priority)

! Egress queuing: scheduling + ECN for congestion signaling
policy-map type queuing RDMA_ECN_OUT
  class type queuing c-out-q3
    priority level 1                              ← strict priority (lowest latency)
    random-detect threshold burst-optimized ecn   ← DCQCN congestion signaling
  class type queuing c-out-q2
    bandwidth remaining percent 0
  class type queuing c-out-q1
    bandwidth remaining percent 0
  class type queuing c-out-q-default
    bandwidth remaining percent 100               ← all remaining BW for non-RDMA

System QoS Application

! Apply policies globally to the switching ASIC
system qos
  service-policy type network-qos QOS_NETWORK      ← PFC + MTU
  service-policy type queuing output RDMA_ECN_OUT  ← ECN + scheduling

ECN + DCQCN Explained

ECN (Explicit Congestion Notification) marks packets with a congestion bit instead of dropping them. When the ConnectX NIC receives an ECN-marked packet, it triggers DCQCN (Data Center QCN) — the NIC reduces its sending rate proactively, preventing queue buildup before PFC needs to pause. This gives us a two-layer defense:

  • Layer 1 — ECN/DCQCN: Proactive rate reduction (soft congestion signal)
  • Layer 2 — PFC: Last-resort pause frames (hard flow control, prevents drops)

6.4 Design Summary

LayerPolicy / FeaturePurposeKey Setting
Input Classification QOS_MARKING Identify RDMA traffic DSCP 26 + UDP 4741/4791 → qos-group 3
Network QoS QOS_NETWORK Lossless transport PFC pause on CoS 3, MTU 9216
Egress Queuing RDMA_ECN_OUT Priority + congestion Queue 3 = strict priority + ECN
Interface PFC mode on Per-port flow control IEEE 802.1Qbb on all ports
Protocol LLDP + DCBX Capability exchange Negotiate PFC parameters with NICs

QoS at a Glance

DSCP 26
RDMA Classification
+ RoCE UDP 4741/4791
CoS 3
PFC Lossless Queue
IEEE 802.1Qbb Pause
ECN
DCQCN Signaling
Proactive rate control
9216
Jumbo MTU
All queues

Cleanup Scripts

ScriptPurposeTargets
check_leaf1_qos.py Audit QoS/DCB/PFC configuration on Leaf1 Leaf1
cleanup_leaf1_qos.py Remove unused class-maps & policy-maps from Leaf1 Leaf1
check_all_qos.py Audit QoS configuration on all 4 switches All 4 switches
cleanup_qos_all.py Clean Leaf2 junk + add ACL rdma to both spines Leaf2, Spine1, Spine2

7. RDMA Rail Design

Concept: Each leaf switch acts as a dedicated "rail" for one port of each dual-port NIC. This ensures deterministic, low-latency paths for RDMA traffic. NCCL binds each GPU to a specific NIC, and Linux policy routing ensures traffic from that NIC always goes through the correct leaf.

Rail 0 — Leaf1

  • VLAN 100
  • Subnet: 10.0.0.0/24
  • SVI Gateway: 10.0.0.254
  • Server NIC: ens6d1
  • Routing Table: 100 (rail0)

Rail 1 — Leaf2

  • VLAN 101
  • Subnet: 10.0.1.0/24
  • SVI Gateway: 10.0.1.254
  • Server NIC: ens6
  • Routing Table: 101 (rail1)

7.1 Rail 0 — Leaf1 / VLAN 100

! Leaf1 - Rail 0 Switch Configuration

system jumbomtu 9216

interface Eth1/27
  switchport access vlan 100
  mtu 9216
  no shutdown

interface Eth1/28
  switchport access vlan 100
  mtu 9216
  no shutdown

interface Vlan100
  no shutdown
  mtu 9216
  ip address 10.0.0.254/24

7.2 Rail 1 — Leaf2 / VLAN 101

! Leaf2 - Rail 1 Switch Configuration

system jumbomtu 9216

interface Eth1/27
  switchport access vlan 101
  mtu 9216
  no shutdown

interface Eth1/28
  switchport access vlan 101
  mtu 9216
  no shutdown

interface Vlan101
  no shutdown
  mtu 9216
  ip address 10.0.1.254/24

8. GPU Server Configuration

8.1 NIC IP Addressing & MTU

Each server has a dual-port Mellanox ConnectX-3 Pro NIC. Port 1 (ens6d1) connects to Leaf1 (Rail 0) and Port 2 (ens6) connects to Leaf2 (Rail 1). MTU is set to 9000 on both NICs.

gpuserver1 192.168.51.73
# ens6d1 (Rail 0 - Leaf1)
ip addr add 10.0.0.1/24 dev ens6d1
ip link set ens6d1 mtu 9000
ip link set ens6d1 up

# ens6 (Rail 1 - Leaf2)
ip addr add 10.0.1.1/24 dev ens6
ip link set ens6 mtu 9000
ip link set ens6 up
gpuserver2 192.168.51.71
# ens6d1 (Rail 0 - Leaf1)
ip addr add 10.0.0.2/24 dev ens6d1
ip link set ens6d1 mtu 9000
ip link set ens6d1 up

# ens6 (Rail 1 - Leaf2)
ip addr add 10.0.1.2/24 dev ens6
ip link set ens6 mtu 9000
ip link set ens6 up

8.2 Per-NIC Policy Routing

How it works: Each NIC has its own Linux routing table. An ip rule matches the source IP of outgoing packets to select the correct table. This ensures traffic originating from ens6d1 always routes through Leaf1, and traffic from ens6 always routes through Leaf2.

NCCL chain: NCCL binds GPU → NIC → NIC has source IP → ip rule matches source → correct routing table → correct leaf gateway.

# Step 1: Register routing table names in /etc/iproute2/rt_tables
echo '100 rail0' >> /etc/iproute2/rt_tables
echo '101 rail1' >> /etc/iproute2/rt_tables

# Step 2: Rail 0 routing (ens6d1 → Leaf1 SVI 10.0.0.254)
ip route add 10.0.0.0/24 dev ens6d1 scope link table 100
ip route add default via 10.0.0.254 dev ens6d1 table 100
ip rule add from <ens6d1_ip> table 100

# Step 3: Rail 1 routing (ens6 → Leaf2 SVI 10.0.1.254)
ip route add 10.0.1.0/24 dev ens6 scope link table 101
ip route add default via 10.0.1.254 dev ens6 table 101
ip rule add from <ens6_ip> table 101

Policy Routing per Server

Server Rule: from Table Default Gateway Via Device Leaf
gpuserver110.0.0.1100 (rail0) 10.0.0.254ens6d1Leaf1
gpuserver110.0.1.1101 (rail1) 10.0.1.254ens6Leaf2
gpuserver210.0.0.2100 (rail0) 10.0.0.254ens6d1Leaf1
gpuserver210.0.1.2101 (rail1) 10.0.1.254ens6Leaf2

8.3 Netplan Persistence

Configuration is persisted via /etc/netplan/60-rdma-rails.yaml on both servers (chmod 600).

# /etc/netplan/60-rdma-rails.yaml (gpuserver1 example)
network:
  version: 2
  ethernets:
    ens6d1:
      addresses:
        - 10.0.0.1/24
      mtu: 9000
      routing-policy:
        - from: 10.0.0.1
          table: 100
      routes:
        - to: 0.0.0.0/0
          via: 10.0.0.254
          table: 100
        - to: 10.0.0.0/24
          scope: link
          table: 100
    ens6:
      addresses:
        - 10.0.1.1/24
      mtu: 9000
      routing-policy:
        - from: 10.0.1.1
          table: 101
      routes:
        - to: 0.0.0.0/0
          via: 10.0.1.254
          table: 101
        - to: 10.0.1.0/24
          scope: link
          table: 101

9. MTU Configuration

SegmentMTUWhere
Spine-Leaf fabric links9216Eth1/13, 1/14, 1/17, 1/18 on all switches
Leaf SVIs (Vlan100, Vlan101)9216Leaf1 Vlan100, Leaf2 Vlan101
Leaf server-facing ports9216Eth1/27, Eth1/28 on both leaves
System jumbomtu (L2)9216Both leaves (system jumbomtu 9216)
Server NICs9000ens6d1, ens6 on both GPU servers

10. Change Log & Issues Resolved

Step 1: Gathered Current State

Ran gather_leaf_state.py to collect VLANs, port status, IPs, BGP/OSPF configs, and CDP neighbors from all 4 switches. Found both leaves had Eth1/27 (VLAN 100) and Eth1/28 (VLAN 101) as access ports, but no SVIs. BGP was L2VPN EVPN only with iBGP AS 65101 + OSPF underlay.

Step 2: Configured RDMA Rail VLANs & SVIs

Ran configure_rdma_rails.py. Configured Leaf1 with VLAN 100 SVI (10.0.0.254/24) and Leaf2 with VLAN 101 SVI (10.0.1.254/24). Set jumbo MTU on all fabric and server-facing links. Added BGP IPv4 unicast network statements.

Issue: Vlan1 IP Conflict

Leaf1 had Vlan1 IP 10.0.0.3/24 and Leaf2 had Vlan1 IP 10.0.0.4/24 which overlapped with the Rail 0 subnet (10.0.0.0/24). On Leaf2, the directly connected Vlan1 route (AD 0) beat the BGP route (AD 200) to 10.0.0.0/24.

Fix: Ran fix_vlan1_conflict.py — removed IP from Vlan1 and shut it down on both leaves.

Issue: Leaf1 SVI Missing IP

When configure_rdma_rails.py initially ran, NX-OS silently rejected the Vlan100 IP (10.0.0.254/24) because Vlan1 already had 10.0.0.3/24 in the same subnet. After removing Vlan1's IP, the SVI was still empty.

Fix: Ran fix_leaf1_svi.py — re-applied ip address 10.0.0.254/24 to Vlan100.

Step 3: Migrated to eBGP Underlay

Ran migrate_ospf_to_bgp.py. Removed old iBGP (AS 65101) and OSPF from all 4 switches. Created new eBGP configuration with AS 65000 (spines), 65001 (Leaf1), 65002 (Leaf2). Physical interface peering with BFD. ECMP via maximum-paths 2 on leaves.

Step 4: Configured GPU Server Policy Routing

Ran configure_gpu_servers.py. Configured both servers with per-NIC IPs, MTU 9000, policy routing tables (100/101), and persistent netplan configuration. All cross-server reachability tests passed (same-rail and cross-rail).

Step 5: Updated Topology Diagram

Updated AI_Cluster_Topology.drawio with eBGP AS numbers, correct per-leaf VLANs, SVI IPs, server routing table info, and revised fabric summary.

11. Automation Scripts

All scripts are located in C:\Claude\AI_LAB\scripts\ and use Netmiko for SSH automation.

ScriptPurposeTargets
gather_leaf_state.py Collect current VLANs, ports, IPs, BGP/OSPF configs from all switches All 4 switches
configure_rdma_rails.py Configure VLAN+SVI, access ports, MTU, BGP IPv4 unicast All 4 switches
fix_vlan1_conflict.py Remove conflicting Vlan1 IPs overlapping with RDMA subnets Leaf1, Leaf2
fix_leaf1_svi.py Re-apply missing IP address to Leaf1 Vlan100 SVI Leaf1
diagnose_svi.py Diagnostic: check SVI state, running-config, IP interface status Leaf1, Leaf2
verify_rdma_routing.py Verify BGP tables, summaries, and routes on all switches All 4 switches
migrate_ospf_to_bgp.py Migrate from iBGP+OSPF to eBGP with physical interface peering All 4 switches
configure_gpu_servers.py Configure per-NIC IPs, MTU, policy routing, netplan on GPU servers gpuserver1, gpuserver2

12. Verification Results

eBGP Sessions — All Established

DeviceNeighborRemote ASState
Spine110.4.0.5 (Leaf1)65001Established
Spine110.4.0.14 (Leaf2)65002Established
Spine210.4.0.1 (Leaf1)65001Established
Spine210.4.0.9 (Leaf2)65002Established
Leaf110.4.0.6 (Spine1)65000Established
Leaf110.4.0.2 (Spine2)65000Established
Leaf210.4.0.13 (Spine1)65000Established
Leaf210.4.0.10 (Spine2)65000Established

Cross-Rail Routing — ECMP Working

Leaves have 2 equal-cost paths to remote rail subnets via both spines:

! Leaf1: route to Rail 1 subnet (10.0.1.0/24) - 2 paths
10.0.1.0/24, ubest/mbest: 2/0
    *via 10.4.0.6, [20/0], BGP-65000   ← via Spine1
    *via 10.4.0.2, [20/0], BGP-65000   ← via Spine2

! Leaf2: route to Rail 0 subnet (10.0.0.0/24) - 2 paths
10.0.0.0/24, ubest/mbest: 2/0
    *via 10.4.0.13, [20/0], BGP-65000  ← via Spine1
    *via 10.4.0.10, [20/0], BGP-65000  ← via Spine2

Cross-Server Reachability — All Passed

TestFromToPathResult
Same-rail (Rail 0) gpu1 10.0.0.1gpu2 10.0.0.2 via Leaf1 only PASS
Same-rail (Rail 1) gpu1 10.0.1.1gpu2 10.0.1.2 via Leaf2 only PASS
Cross-rail gpu1 Rail 0 (10.0.0.1)gpu2 Rail 1 (10.0.1.2) Leaf1 → Spine → Leaf2 PASS
Cross-rail gpu1 Rail 1 (10.0.1.1)gpu2 Rail 0 (10.0.0.2) Leaf2 → Spine → Leaf1 PASS

13. RDMA Performance Results

Tested with ib_write_bw and ib_write_lat (perftest suite) using RDMA Write operations over RoCE (RDMA over Converged Ethernet). All tests run between gpuserver1 and gpuserver2.

ParameterValue
RDMA Devicerocep130s0 (Mellanox ConnectX-3 Pro)
Link Speed40 GbE per port
IB MTU4096 bytes (-m 4096)
Ethernet MTU9000 (servers) / 9216 (switches)
ModeRoCE (-F flag), all message sizes (-a)
ConnectionRC (Reliable Connection)

13.1 Bandwidth Tests (ib_write_bw)

Results — IB MTU 4096

Test Path Peak BW Avg @ 8MB % Wire Rate
Same Rail 0 gpu1 10.0.0.1 ↔ gpu2 10.0.0.2 via Leaf1 38.98 Gb/s 30.70 Gb/s ~97%
Same Rail 1 gpu1 10.0.1.1 ↔ gpu2 10.0.1.2 via Leaf2 38.95 Gb/s 30.48 Gb/s ~97%
Cross-Rail gpu1 Rail0 10.0.0.1 → gpu2 Rail1 10.0.1.2 via Spine 38.94 Gb/s 37.36 Gb/s ~97%

MTU Comparison — 2048 vs 4096

Test Peak (MTU 2048) Peak (MTU 4096) Improvement Cross-Rail Avg@8MB (2048) Cross-Rail Avg@8MB (4096)
Same Rail 0 38.01 Gb/s 38.98 Gb/s +2.5%
Same Rail 1 38.44 Gb/s 38.95 Gb/s +1.3%
Cross-Rail 38.07 Gb/s 38.94 Gb/s +2.3% 33.17 Gb/s 37.36 Gb/s (+12.6%)

13.2 Latency Tests (ib_write_lat)

Results — IB MTU 4096

Test 2 bytes 1 KB 64 KB 8 MB
Same Rail 0 2.92 μs 4.56 μs 18.70 μs 2,137 μs
Same Rail 1 3.05 μs 4.55 μs 18.67 μs 2,119 μs
Cross-Rail 5.37 μs 7.99 μs 24.45 μs 2,125 μs

Latency Analysis

  • Same-rail small message latency: ~3 μs — packet traverses Server NIC → Leaf switch → Server NIC (1 switch hop)
  • Cross-rail adds ~2.4 μs — packet traverses Leaf → Spine → Leaf (3 switch hops instead of 1)
  • At large sizes (8 MB), all paths converge to ~2.1 ms — serialization time dominates over switching latency
  • Both rails symmetric — Leaf1 (Rail 0) and Leaf2 (Rail 1) perform identically

13.3 How Our Lab Compares

RDMA bypasses the kernel and TCP/IP stack entirely — data moves directly from NIC memory to NIC memory (zero-copy). This is why RDMA latency is orders of magnitude lower than regular TCP networking.

TechnologyTypical LatencyNotes
Our Same-Rail RDMA (40GbE RoCE) ~3 μs 1 switch hop (server → leaf → server)
Our Cross-Rail RDMA (via Spine) ~5.4 μs 3 switch hops (leaf → spine → leaf)
Typical TCP ping (same network) ~100–300 μs Kernel stack, context switches, TCP overhead
Regular Ethernet (no RDMA) ~50–100 μs Still goes through kernel networking stack
NVIDIA NVLink (GPU-to-GPU) ~1–2 μs Direct GPU interconnect within same server
PCIe (within same server) ~0.5–1 μs CPU-to-device within single machine

Bandwidth Summary

~39 Gb/s
Peak Bandwidth
97% of 40GbE wire rate
~3 μs
Same-Rail Latency
Small messages (2 bytes)
+2.4 μs
Cross-Rail Overhead
Extra spine hop penalty

Test Commands Reference

# Bandwidth test -- Server side (gpuserver2):
ib_write_bw -d rocep130s0 -i <ib_port> --source_ip <server_ip> --port=<tcp_port> -m 4096 -F --report_gbits -a

# Bandwidth test -- Client side (gpuserver1):
ib_write_bw -d rocep130s0 -i <ib_port> --source_ip <client_ip> --port=<tcp_port> -m 4096 -F --report_gbits -a <server_ip>

# Latency test -- same flags, replace ib_write_bw with ib_write_lat (no --report_gbits)
ib_write_lat -d rocep130s0 -i <ib_port> --source_ip <ip> --port=<tcp_port> -m 4096 -F -a [server_ip]

# IB Port mapping:
#   -i 1 = ens6   (Rail 1, 10.0.1.x via Leaf2)
#   -i 2 = ens6d1 (Rail 0, 10.0.0.x via Leaf1)
# IMPORTANT: Use -i for IB port, NOT -p (which sets TCP port)
# Use -m 4096 for jumbo IB MTU (default is only 2048)
# Use unique --port values (19001, 19002...) to avoid TCP conflicts