GPU AI Cluster Lab - Configuration Documentation

1. Network Overview

This lab implements an AI/GPU cluster network using a Cisco Nexus 9332PQ spine-leaf fabric with a rail-based L3 routing design optimized for RDMA traffic between GPU servers.

Key Design Decisions

eBGP underlay (no OSPF, no EVPN/VXLAN) — each switch has its own AS number
Rail design — each leaf = one rail, each NIC on a server maps to exactly one leaf/rail
No LACP — rails provide parallelism without hashing latency
Per-NIC policy routing on servers — source-based routing tables for deterministic paths
Cross-rail traffic routed through spines via eBGP (ECMP with 2 paths)
Jumbo MTU — 9216 on fabric, 9000 on servers
BFD enabled on all eBGP peerings for fast failure detection

2. Physical Topology

3. Device Inventory & IP Addressing

Network Switches

Device	Role	Platform	NX-OS	Mgmt IP	Loopback0	BGP AS
NX_AI_Spine1	Spine	Nexus 9332PQ	9.3(13)	`192.168.51.232`	`10.2.0.1/32`	`65000`
NX_AI_Spine2	Spine	Nexus 9332PQ	9.3(13)	`192.168.51.231`	`10.2.0.2/32`	`65000`
NX_AI_Leaf1	Leaf Rail 0	Nexus 9332PQ	9.3(13)	`192.168.50.229`	`10.2.0.3/32`	`65001`
NX_AI_Leaf2	Leaf Rail 1	Nexus 9332PQ	9.3(13)	`192.168.51.230`	`10.2.0.4/32`	`65002`

GPU Servers

Server	OS	NIC	Mgmt IP	ens6d1 (Rail 0)	ens6 (Rail 1)
gpuserver1	Ubuntu	Mellanox ConnectX-3 Pro (Dual-port 40G)	`192.168.51.73`	`10.0.0.1/24`	`10.0.1.1/24`
gpuserver2	Ubuntu	Mellanox ConnectX-3 Pro (Dual-port 40G)	`192.168.51.71`	`10.0.0.2/24`	`10.0.1.2/24`

Other Devices

Device	Role	IP
Lab_3750X	Management Switch	`192.168.51.142`
ESXi Host	VMware ESXi 7.0	`192.168.50.32`

4. Fabric Link Addressing

All fabric links are 40G QSFP+ point-to-point L3 links using /30 subnets.

Link	Side A	Interface	IP	Side B	Interface	IP	Subnet
1	Spine1	Eth1/14	`10.4.0.6`	Leaf1	Eth1/14	`10.4.0.5`	`10.4.0.4/30`
2	Spine1	Eth1/18	`10.4.0.13`	Leaf2	Eth1/18	`10.4.0.14`	`10.4.0.12/30`
3	Spine2	Eth1/13	`10.4.0.2`	Leaf1	Eth1/13	`10.4.0.1`	`10.4.0.0/30`
4	Spine2	Eth1/17	`10.4.0.10`	Leaf2	Eth1/17	`10.4.0.9`	`10.4.0.8/30`

Server-Facing Links

Switch	Port	VLAN	Connected To	Server NIC	Speed
Leaf1	Eth1/27	100 (access)	gpuserver1	ens6d1	40G
Leaf1	Eth1/28	100 (access)	gpuserver2	ens6d1	40G
Leaf2	Eth1/27	101 (access)	gpuserver1	ens6	40G
Leaf2	Eth1/28	101 (access)	gpuserver2	ens6	40G
Leaf2	Eth1/1/1	—	ESXi Host	vmnic5	10G

5. eBGP Underlay Configuration

Design: eBGP with physical interface peering (no loopback peering, no OSPF). Each tier has its own AS number. Leaves have maximum-paths 2 for ECMP across both spines. BFD is enabled on all BGP sessions for sub-second failover.

Device	BGP AS	Router-ID	Networks Advertised
Spine1	65000	10.2.0.1	`10.2.0.1/32`
Spine2	65000	10.2.0.2	`10.2.0.2/32`
Leaf1	65001	10.2.0.3	`10.2.0.3/32`, `10.3.0.1/32`, `10.0.0.0/24`
Leaf2	65002	10.2.0.4	`10.2.0.4/32`, `10.3.0.2/32`, `10.0.1.0/24`

5.1 NX_AI_Spine1 — BGP Configuration

! NX_AI_Spine1 (192.168.51.232) - AS 65000

router bgp 65000
  router-id 10.2.0.1
  address-family ipv4 unicast
    network 10.2.0.1/32

  ! Peer to Leaf1 via Eth1/14
  neighbor 10.4.0.5
    remote-as 65001
    description to-NX-AI-Leaf1
    bfd
    address-family ipv4 unicast

  ! Peer to Leaf2 via Eth1/18
  neighbor 10.4.0.14
    remote-as 65002
    description to-NX-AI-Leaf2
    bfd
    address-family ipv4 unicast

5.2 NX_AI_Spine2 — BGP Configuration

! NX_AI_Spine2 (192.168.51.231) - AS 65000

router bgp 65000
  router-id 10.2.0.2
  address-family ipv4 unicast
    network 10.2.0.2/32

  ! Peer to Leaf1 via Eth1/13
  neighbor 10.4.0.1
    remote-as 65001
    description to-NX-AI-Leaf1
    bfd
    address-family ipv4 unicast

  ! Peer to Leaf2 via Eth1/17
  neighbor 10.4.0.9
    remote-as 65002
    description to-NX-AI-Leaf2
    bfd
    address-family ipv4 unicast

5.3 NX_AI_Leaf1 — BGP Configuration

! NX_AI_Leaf1 (192.168.50.229) - AS 65001

router bgp 65001
  router-id 10.2.0.3
  address-family ipv4 unicast
    network 10.2.0.3/32
    network 10.3.0.1/32
    network 10.0.0.0/24           ! Rail 0 SVI subnet
    maximum-paths 2              ! ECMP across both spines

  ! Peer to Spine1 via Eth1/14
  neighbor 10.4.0.6
    remote-as 65000
    description to-NX-AI-Spine1
    bfd
    address-family ipv4 unicast

  ! Peer to Spine2 via Eth1/13
  neighbor 10.4.0.2
    remote-as 65000
    description to-NX-AI-Spine2
    bfd
    address-family ipv4 unicast

5.4 NX_AI_Leaf2 — BGP Configuration

! NX_AI_Leaf2 (192.168.51.230) - AS 65002

router bgp 65002
  router-id 10.2.0.4
  address-family ipv4 unicast
    network 10.2.0.4/32
    network 10.3.0.2/32
    network 10.0.1.0/24           ! Rail 1 SVI subnet
    maximum-paths 2              ! ECMP across both spines

  ! Peer to Spine1 via Eth1/18
  neighbor 10.4.0.13
    remote-as 65000
    description to-NX-AI-Spine1
    bfd
    address-family ipv4 unicast

  ! Peer to Spine2 via Eth1/17
  neighbor 10.4.0.10
    remote-as 65000
    description to-NX-AI-Spine2
    bfd
    address-family ipv4 unicast

6. QoS / PFC / ECN Configuration (Lossless RDMA)

All 4 switches (Spine1, Spine2, Leaf1, Leaf2) run identical QoS configuration for lossless RoCE v2. RDMA traffic is classified by DSCP 26 and RoCE UDP ports (4741/4791), mapped to CoS 3 / qos-group 3, with Priority Flow Control (PFC) preventing packet drops and ECN signaling congestion before queues overflow.

Configuration audited and cleaned up February 2026. Unused leftover class-maps (RDM, RDMA_2, RDMA_Class) and policy-maps (ROCE_NET_POLICY, testcos) were removed from all switches.

6.1 Classification & Marking

ACL — RoCE UDP Port Matching

ip access-list rdma
  10 permit udp any any eq 4741
  20 permit udp any eq 4741 any
  30 permit udp any eq 4791 any
  40 permit udp any any eq 4791
! UDP 4741 = RoCE v1, UDP 4791 = RoCE v2

Class Maps — RDMA Traffic Identification

! Match DSCP 26 (CS3/AF31 — standard RoCE marking)
class-map type qos match-all RDMA
  match dscp 26

! Match DSCP 26 OR RoCE UDP ports (broader catch-all)
class-map type qos match-any RDMA_UDP
  match dscp 26
  match access-group name rdma

Input Marking Policy

! Classify RDMA traffic into qos-group 3 for downstream processing
policy-map type qos QOS_MARKING
  class RDMA
    set qos-group 3
  class RDMA_UDP
    set qos-group 3

Classification Flow

Ingress packet → DSCP 26? or UDP 4741/4791? → qos-group 3 → CoS 3 queue → PFC protected + ECN marked

6.2 Network QoS (PFC + MTU)

! Network QoS: controls MTU per queue and PFC behavior
policy-map type network-qos QOS_NETWORK
  class type network-qos c-nq3
    mtu 9216              ← jumbo frames for RDMA queue
    pause pfc-cos 3       ← IEEE 802.1Qbb PFC on CoS 3
  class type network-qos c-nq-default
    mtu 9216              ← jumbo for all other traffic too

Per-Interface PFC Settings

! Applied to ALL server-facing and fabric-facing interfaces:
priority-flow-control mode on

! Global features enabled:
feature lldp     ← Link Layer Discovery Protocol
feature dcbx     ← Data Center Bridging Capability Exchange

How PFC Prevents RDMA Packet Loss

When a switch queue for CoS 3 fills to a threshold, PFC sends an IEEE 802.1Qbb PAUSE frame back to the upstream sender, telling it to stop transmitting on that priority class. This creates a lossless fabric — the upstream device buffers packets instead of the downstream device dropping them. Without PFC, RoCE performance degrades catastrophically because RDMA relies on the transport being lossless.

6.3 Egress Queuing (ECN + Priority)

! Egress queuing: scheduling + ECN for congestion signaling
policy-map type queuing RDMA_ECN_OUT
  class type queuing c-out-q3
    priority level 1                              ← strict priority (lowest latency)
    random-detect threshold burst-optimized ecn   ← DCQCN congestion signaling
  class type queuing c-out-q2
    bandwidth remaining percent 0
  class type queuing c-out-q1
    bandwidth remaining percent 0
  class type queuing c-out-q-default
    bandwidth remaining percent 100               ← all remaining BW for non-RDMA

System QoS Application

! Apply policies globally to the switching ASIC
system qos
  service-policy type network-qos QOS_NETWORK      ← PFC + MTU
  service-policy type queuing output RDMA_ECN_OUT  ← ECN + scheduling

ECN + DCQCN Explained

ECN (Explicit Congestion Notification) marks packets with a congestion bit instead of dropping them. When the ConnectX NIC receives an ECN-marked packet, it triggers DCQCN (Data Center QCN) — the NIC reduces its sending rate proactively, preventing queue buildup before PFC needs to pause. This gives us a two-layer defense:

Layer 1 — ECN/DCQCN: Proactive rate reduction (soft congestion signal)
Layer 2 — PFC: Last-resort pause frames (hard flow control, prevents drops)

6.4 Design Summary

Layer	Policy / Feature	Purpose	Key Setting
Input Classification	`QOS_MARKING`	Identify RDMA traffic	DSCP 26 + UDP 4741/4791 → qos-group 3
Network QoS	`QOS_NETWORK`	Lossless transport	PFC pause on CoS 3, MTU 9216
Egress Queuing	`RDMA_ECN_OUT`	Priority + congestion	Queue 3 = strict priority + ECN
Interface	`PFC mode on`	Per-port flow control	IEEE 802.1Qbb on all ports
Protocol	`LLDP + DCBX`	Capability exchange	Negotiate PFC parameters with NICs

QoS at a Glance

DSCP 26

RDMA Classification

+ RoCE UDP 4741/4791

CoS 3

PFC Lossless Queue

IEEE 802.1Qbb Pause

ECN

DCQCN Signaling

Proactive rate control

9216

Jumbo MTU

All queues

Cleanup Scripts

Script	Purpose	Targets
`check_leaf1_qos.py`	Audit QoS/DCB/PFC configuration on Leaf1	Leaf1
`cleanup_leaf1_qos.py`	Remove unused class-maps & policy-maps from Leaf1	Leaf1
`check_all_qos.py`	Audit QoS configuration on all 4 switches	All 4 switches
`cleanup_qos_all.py`	Clean Leaf2 junk + add ACL rdma to both spines	Leaf2, Spine1, Spine2

7. RDMA Rail Design

Concept: Each leaf switch acts as a dedicated "rail" for one port of each dual-port NIC. This ensures deterministic, low-latency paths for RDMA traffic. NCCL binds each GPU to a specific NIC, and Linux policy routing ensures traffic from that NIC always goes through the correct leaf.

Rail 0 — Leaf1

VLAN 100
Subnet: 10.0.0.0/24
SVI Gateway: 10.0.0.254
Server NIC: ens6d1
Routing Table: 100 (rail0)

Rail 1 — Leaf2

VLAN 101
Subnet: 10.0.1.0/24
SVI Gateway: 10.0.1.254
Server NIC: ens6
Routing Table: 101 (rail1)

7.1 Rail 0 — Leaf1 / VLAN 100

! Leaf1 - Rail 0 Switch Configuration

system jumbomtu 9216

interface Eth1/27
  switchport access vlan 100
  mtu 9216
  no shutdown

interface Eth1/28
  switchport access vlan 100
  mtu 9216
  no shutdown

interface Vlan100
  no shutdown
  mtu 9216
  ip address 10.0.0.254/24

7.2 Rail 1 — Leaf2 / VLAN 101

! Leaf2 - Rail 1 Switch Configuration

system jumbomtu 9216

interface Eth1/27
  switchport access vlan 101
  mtu 9216
  no shutdown

interface Eth1/28
  switchport access vlan 101
  mtu 9216
  no shutdown

interface Vlan101
  no shutdown
  mtu 9216
  ip address 10.0.1.254/24

8. GPU Server Configuration

8.1 NIC IP Addressing & MTU

Each server has a dual-port Mellanox ConnectX-3 Pro NIC. Port 1 (ens6d1) connects to Leaf1 (Rail 0) and Port 2 (ens6) connects to Leaf2 (Rail 1). MTU is set to 9000 on both NICs.

gpuserver1 192.168.51.73

# ens6d1 (Rail 0 - Leaf1)
ip addr add 10.0.0.1/24 dev ens6d1
ip link set ens6d1 mtu 9000
ip link set ens6d1 up

# ens6 (Rail 1 - Leaf2)
ip addr add 10.0.1.1/24 dev ens6
ip link set ens6 mtu 9000
ip link set ens6 up

gpuserver2 192.168.51.71

# ens6d1 (Rail 0 - Leaf1)
ip addr add 10.0.0.2/24 dev ens6d1
ip link set ens6d1 mtu 9000
ip link set ens6d1 up

# ens6 (Rail 1 - Leaf2)
ip addr add 10.0.1.2/24 dev ens6
ip link set ens6 mtu 9000
ip link set ens6 up

8.2 Per-NIC Policy Routing

How it works: Each NIC has its own Linux routing table. An ip rule matches the source IP of outgoing packets to select the correct table. This ensures traffic originating from ens6d1 always routes through Leaf1, and traffic from ens6 always routes through Leaf2.

NCCL chain: NCCL binds GPU → NIC → NIC has source IP → ip rule matches source → correct routing table → correct leaf gateway.

# Step 1: Register routing table names in /etc/iproute2/rt_tables
echo '100 rail0' >> /etc/iproute2/rt_tables
echo '101 rail1' >> /etc/iproute2/rt_tables

# Step 2: Rail 0 routing (ens6d1 → Leaf1 SVI 10.0.0.254)
ip route add 10.0.0.0/24 dev ens6d1 scope link table 100
ip route add default via 10.0.0.254 dev ens6d1 table 100
ip rule add from <ens6d1_ip> table 100

# Step 3: Rail 1 routing (ens6 → Leaf2 SVI 10.0.1.254)
ip route add 10.0.1.0/24 dev ens6 scope link table 101
ip route add default via 10.0.1.254 dev ens6 table 101
ip rule add from <ens6_ip> table 101

Policy Routing per Server

Server	Rule: from	Table	Default Gateway	Via Device	Leaf
gpuserver1	`10.0.0.1`	100 (rail0)	`10.0.0.254`	ens6d1	Leaf1
gpuserver1	`10.0.1.1`	101 (rail1)	`10.0.1.254`	ens6	Leaf2
gpuserver2	`10.0.0.2`	100 (rail0)	`10.0.0.254`	ens6d1	Leaf1
gpuserver2	`10.0.1.2`	101 (rail1)	`10.0.1.254`	ens6	Leaf2

8.3 Netplan Persistence

Configuration is persisted via /etc/netplan/60-rdma-rails.yaml on both servers (chmod 600).

# /etc/netplan/60-rdma-rails.yaml (gpuserver1 example)
network:
  version: 2
  ethernets:
    ens6d1:
      addresses:
        - 10.0.0.1/24
      mtu: 9000
      routing-policy:
        - from: 10.0.0.1
          table: 100
      routes:
        - to: 0.0.0.0/0
          via: 10.0.0.254
          table: 100
        - to: 10.0.0.0/24
          scope: link
          table: 100
    ens6:
      addresses:
        - 10.0.1.1/24
      mtu: 9000
      routing-policy:
        - from: 10.0.1.1
          table: 101
      routes:
        - to: 0.0.0.0/0
          via: 10.0.1.254
          table: 101
        - to: 10.0.1.0/24
          scope: link
          table: 101

9. MTU Configuration

Segment	MTU	Where
Spine-Leaf fabric links	`9216`	Eth1/13, 1/14, 1/17, 1/18 on all switches
Leaf SVIs (Vlan100, Vlan101)	`9216`	Leaf1 Vlan100, Leaf2 Vlan101
Leaf server-facing ports	`9216`	Eth1/27, Eth1/28 on both leaves
System jumbomtu (L2)	`9216`	Both leaves (`system jumbomtu 9216`)
Server NICs	`9000`	ens6d1, ens6 on both GPU servers

10. Change Log & Issues Resolved

Step 1: Gathered Current State

Ran gather_leaf_state.py to collect VLANs, port status, IPs, BGP/OSPF configs, and CDP neighbors from all 4 switches. Found both leaves had Eth1/27 (VLAN 100) and Eth1/28 (VLAN 101) as access ports, but no SVIs. BGP was L2VPN EVPN only with iBGP AS 65101 + OSPF underlay.

Step 2: Configured RDMA Rail VLANs & SVIs

Ran configure_rdma_rails.py. Configured Leaf1 with VLAN 100 SVI (10.0.0.254/24) and Leaf2 with VLAN 101 SVI (10.0.1.254/24). Set jumbo MTU on all fabric and server-facing links. Added BGP IPv4 unicast network statements.

Issue: Vlan1 IP Conflict

Leaf1 had Vlan1 IP 10.0.0.3/24 and Leaf2 had Vlan1 IP 10.0.0.4/24 which overlapped with the Rail 0 subnet (10.0.0.0/24). On Leaf2, the directly connected Vlan1 route (AD 0) beat the BGP route (AD 200) to 10.0.0.0/24.

Fix: Ran fix_vlan1_conflict.py — removed IP from Vlan1 and shut it down on both leaves.

Issue: Leaf1 SVI Missing IP

When configure_rdma_rails.py initially ran, NX-OS silently rejected the Vlan100 IP (10.0.0.254/24) because Vlan1 already had 10.0.0.3/24 in the same subnet. After removing Vlan1's IP, the SVI was still empty.

Fix: Ran fix_leaf1_svi.py — re-applied ip address 10.0.0.254/24 to Vlan100.

Step 3: Migrated to eBGP Underlay

Ran migrate_ospf_to_bgp.py. Removed old iBGP (AS 65101) and OSPF from all 4 switches. Created new eBGP configuration with AS 65000 (spines), 65001 (Leaf1), 65002 (Leaf2). Physical interface peering with BFD. ECMP via maximum-paths 2 on leaves.

Step 4: Configured GPU Server Policy Routing

Ran configure_gpu_servers.py. Configured both servers with per-NIC IPs, MTU 9000, policy routing tables (100/101), and persistent netplan configuration. All cross-server reachability tests passed (same-rail and cross-rail).

Step 5: Updated Topology Diagram

Updated AI_Cluster_Topology.drawio with eBGP AS numbers, correct per-leaf VLANs, SVI IPs, server routing table info, and revised fabric summary.

11. Automation Scripts

All scripts are located in C:\Claude\AI_LAB\scripts\ and use Netmiko for SSH automation.

Script	Purpose	Targets
`gather_leaf_state.py`	Collect current VLANs, ports, IPs, BGP/OSPF configs from all switches	All 4 switches
`configure_rdma_rails.py`	Configure VLAN+SVI, access ports, MTU, BGP IPv4 unicast	All 4 switches
`fix_vlan1_conflict.py`	Remove conflicting Vlan1 IPs overlapping with RDMA subnets	Leaf1, Leaf2
`fix_leaf1_svi.py`	Re-apply missing IP address to Leaf1 Vlan100 SVI	Leaf1
`diagnose_svi.py`	Diagnostic: check SVI state, running-config, IP interface status	Leaf1, Leaf2
`verify_rdma_routing.py`	Verify BGP tables, summaries, and routes on all switches	All 4 switches
`migrate_ospf_to_bgp.py`	Migrate from iBGP+OSPF to eBGP with physical interface peering	All 4 switches
`configure_gpu_servers.py`	Configure per-NIC IPs, MTU, policy routing, netplan on GPU servers	gpuserver1, gpuserver2

12. Verification Results

eBGP Sessions — All Established

Device	Neighbor	Remote AS	State
Spine1	`10.4.0.5` (Leaf1)	65001	Established
Spine1	`10.4.0.14` (Leaf2)	65002	Established
Spine2	`10.4.0.1` (Leaf1)	65001	Established
Spine2	`10.4.0.9` (Leaf2)	65002	Established
Leaf1	`10.4.0.6` (Spine1)	65000	Established
Leaf1	`10.4.0.2` (Spine2)	65000	Established
Leaf2	`10.4.0.13` (Spine1)	65000	Established
Leaf2	`10.4.0.10` (Spine2)	65000	Established

Cross-Rail Routing — ECMP Working

Leaves have 2 equal-cost paths to remote rail subnets via both spines:

! Leaf1: route to Rail 1 subnet (10.0.1.0/24) - 2 paths
10.0.1.0/24, ubest/mbest: 2/0
    *via 10.4.0.6, [20/0], BGP-65000   ← via Spine1
    *via 10.4.0.2, [20/0], BGP-65000   ← via Spine2

! Leaf2: route to Rail 0 subnet (10.0.0.0/24) - 2 paths
10.0.0.0/24, ubest/mbest: 2/0
    *via 10.4.0.13, [20/0], BGP-65000  ← via Spine1
    *via 10.4.0.10, [20/0], BGP-65000  ← via Spine2

Cross-Server Reachability — All Passed

Test	From	To	Path	Result
Same-rail (Rail 0)	gpu1 10.0.0.1	gpu2 10.0.0.2	via Leaf1 only	PASS
Same-rail (Rail 1)	gpu1 10.0.1.1	gpu2 10.0.1.2	via Leaf2 only	PASS
Cross-rail	gpu1 Rail 0 (10.0.0.1)	gpu2 Rail 1 (10.0.1.2)	Leaf1 → Spine → Leaf2	PASS
Cross-rail	gpu1 Rail 1 (10.0.1.1)	gpu2 Rail 0 (10.0.0.2)	Leaf2 → Spine → Leaf1	PASS

13. RDMA Performance Results

Tested with ib_write_bw and ib_write_lat (perftest suite) using RDMA Write operations over RoCE (RDMA over Converged Ethernet). All tests run between gpuserver1 and gpuserver2.

Parameter	Value
RDMA Device	`rocep130s0` (Mellanox ConnectX-3 Pro)
Link Speed	40 GbE per port
IB MTU	4096 bytes (`-m 4096`)
Ethernet MTU	9000 (servers) / 9216 (switches)
Mode	RoCE (`-F` flag), all message sizes (`-a`)
Connection	RC (Reliable Connection)

13.1 Bandwidth Tests (ib_write_bw)

Results — IB MTU 4096

Test	Path	Peak BW	Avg @ 8MB	% Wire Rate
Same Rail 0	gpu1 `10.0.0.1` ↔ gpu2 `10.0.0.2` via Leaf1	38.98 Gb/s	30.70 Gb/s	~97%
Same Rail 1	gpu1 `10.0.1.1` ↔ gpu2 `10.0.1.2` via Leaf2	38.95 Gb/s	30.48 Gb/s	~97%
Cross-Rail	gpu1 Rail0 `10.0.0.1` → gpu2 Rail1 `10.0.1.2` via Spine	38.94 Gb/s	37.36 Gb/s	~97%

MTU Comparison — 2048 vs 4096

Test	Peak (MTU 2048)	Peak (MTU 4096)	Improvement	Cross-Rail Avg@8MB (2048)	Cross-Rail Avg@8MB (4096)
Same Rail 0	38.01 Gb/s	38.98 Gb/s	+2.5%	—
Same Rail 1	38.44 Gb/s	38.95 Gb/s	+1.3%	—
Cross-Rail	38.07 Gb/s	38.94 Gb/s	+2.3%	33.17 Gb/s	37.36 Gb/s (+12.6%)

13.2 Latency Tests (ib_write_lat)

Results — IB MTU 4096

Test	2 bytes	1 KB	64 KB	8 MB
Same Rail 0	2.92 μs	4.56 μs	18.70 μs	2,137 μs
Same Rail 1	3.05 μs	4.55 μs	18.67 μs	2,119 μs
Cross-Rail	5.37 μs	7.99 μs	24.45 μs	2,125 μs

Latency Analysis

Same-rail small message latency: ~3 μs — packet traverses Server NIC → Leaf switch → Server NIC (1 switch hop)
Cross-rail adds ~2.4 μs — packet traverses Leaf → Spine → Leaf (3 switch hops instead of 1)
At large sizes (8 MB), all paths converge to ~2.1 ms — serialization time dominates over switching latency
Both rails symmetric — Leaf1 (Rail 0) and Leaf2 (Rail 1) perform identically

13.3 How Our Lab Compares

RDMA bypasses the kernel and TCP/IP stack entirely — data moves directly from NIC memory to NIC memory (zero-copy). This is why RDMA latency is orders of magnitude lower than regular TCP networking.

Technology	Typical Latency	Notes
Our Same-Rail RDMA (40GbE RoCE)	~3 μs	1 switch hop (server → leaf → server)
Our Cross-Rail RDMA (via Spine)	~5.4 μs	3 switch hops (leaf → spine → leaf)
Typical TCP ping (same network)	~100–300 μs	Kernel stack, context switches, TCP overhead
Regular Ethernet (no RDMA)	~50–100 μs	Still goes through kernel networking stack
NVIDIA NVLink (GPU-to-GPU)	~1–2 μs	Direct GPU interconnect within same server
PCIe (within same server)	~0.5–1 μs	CPU-to-device within single machine

Bandwidth Summary

~39 Gb/s

Peak Bandwidth

97% of 40GbE wire rate

~3 μs

Same-Rail Latency

Small messages (2 bytes)

+2.4 μs

Cross-Rail Overhead

Extra spine hop penalty

Test Commands Reference

# Bandwidth test -- Server side (gpuserver2):
ib_write_bw -d rocep130s0 -i <ib_port> --source_ip <server_ip> --port=<tcp_port> -m 4096 -F --report_gbits -a

# Bandwidth test -- Client side (gpuserver1):
ib_write_bw -d rocep130s0 -i <ib_port> --source_ip <client_ip> --port=<tcp_port> -m 4096 -F --report_gbits -a <server_ip>

# Latency test -- same flags, replace ib_write_bw with ib_write_lat (no --report_gbits)
ib_write_lat -d rocep130s0 -i <ib_port> --source_ip <ip> --port=<tcp_port> -m 4096 -F -a [server_ip]

# IB Port mapping:
#   -i 1 = ens6   (Rail 1, 10.0.1.x via Leaf2)
#   -i 2 = ens6d1 (Rail 0, 10.0.0.x via Leaf1)
# IMPORTANT: Use -i for IB port, NOT -p (which sets TCP port)
# Use -m 4096 for jumbo IB MTU (default is only 2048)
# Use unique --port values (19001, 19002...) to avoid TCP conflicts

GPU AI Cluster Lab Documentation — Updated February 2026
Project Directory: C:\Claude\AI_LAB\
Topology Diagram: C:\Claude\AI_LAB\diagrams\AI_Cluster_Topology.drawio