1. Network Overview
This lab implements an AI/GPU cluster network using a Cisco Nexus 9332PQ spine-leaf fabric with a rail-based L3 routing design optimized for RDMA traffic between GPU servers.
Key Design Decisions
- eBGP underlay (no OSPF, no EVPN/VXLAN) — each switch has its own AS number
- Rail design — each leaf = one rail, each NIC on a server maps to exactly one leaf/rail
- No LACP — rails provide parallelism without hashing latency
- Per-NIC policy routing on servers — source-based routing tables for deterministic paths
- Cross-rail traffic routed through spines via eBGP (ECMP with 2 paths)
- Jumbo MTU — 9216 on fabric, 9000 on servers
- BFD enabled on all eBGP peerings for fast failure detection
2. Physical Topology
3. Device Inventory & IP Addressing
Network Switches
| Device | Role | Platform | NX-OS | Mgmt IP | Loopback0 | BGP AS |
|---|---|---|---|---|---|---|
| NX_AI_Spine1 | Spine | Nexus 9332PQ | 9.3(13) | 192.168.51.232 |
10.2.0.1/32 |
65000 |
| NX_AI_Spine2 | Spine | Nexus 9332PQ | 9.3(13) | 192.168.51.231 |
10.2.0.2/32 |
65000 |
| NX_AI_Leaf1 | Leaf Rail 0 | Nexus 9332PQ | 9.3(13) | 192.168.50.229 |
10.2.0.3/32 |
65001 |
| NX_AI_Leaf2 | Leaf Rail 1 | Nexus 9332PQ | 9.3(13) | 192.168.51.230 |
10.2.0.4/32 |
65002 |
GPU Servers
| Server | OS | NIC | Mgmt IP | ens6d1 (Rail 0) | ens6 (Rail 1) |
|---|---|---|---|---|---|
| gpuserver1 | Ubuntu | Mellanox ConnectX-3 Pro (Dual-port 40G) | 192.168.51.73 |
10.0.0.1/24 |
10.0.1.1/24 |
| gpuserver2 | Ubuntu | Mellanox ConnectX-3 Pro (Dual-port 40G) | 192.168.51.71 |
10.0.0.2/24 |
10.0.1.2/24 |
Other Devices
| Device | Role | IP |
|---|---|---|
| Lab_3750X | Management Switch | 192.168.51.142 |
| ESXi Host | VMware ESXi 7.0 | 192.168.50.32 |
4. Fabric Link Addressing
All fabric links are 40G QSFP+ point-to-point L3 links using /30 subnets.
| Link | Side A | Interface | IP | Side B | Interface | IP | Subnet |
|---|---|---|---|---|---|---|---|
| 1 | Spine1 | Eth1/14 | 10.4.0.6 |
Leaf1 | Eth1/14 | 10.4.0.5 |
10.4.0.4/30 |
| 2 | Spine1 | Eth1/18 | 10.4.0.13 |
Leaf2 | Eth1/18 | 10.4.0.14 |
10.4.0.12/30 |
| 3 | Spine2 | Eth1/13 | 10.4.0.2 |
Leaf1 | Eth1/13 | 10.4.0.1 |
10.4.0.0/30 |
| 4 | Spine2 | Eth1/17 | 10.4.0.10 |
Leaf2 | Eth1/17 | 10.4.0.9 |
10.4.0.8/30 |
Server-Facing Links
| Switch | Port | VLAN | Connected To | Server NIC | Speed |
|---|---|---|---|---|---|
| Leaf1 | Eth1/27 | 100 (access) | gpuserver1 | ens6d1 | 40G |
| Leaf1 | Eth1/28 | 100 (access) | gpuserver2 | ens6d1 | 40G |
| Leaf2 | Eth1/27 | 101 (access) | gpuserver1 | ens6 | 40G |
| Leaf2 | Eth1/28 | 101 (access) | gpuserver2 | ens6 | 40G |
| Leaf2 | Eth1/1/1 | — | ESXi Host | vmnic5 | 10G |
5. eBGP Underlay Configuration
Design: eBGP with physical interface peering (no loopback peering, no OSPF).
Each tier has its own AS number. Leaves have maximum-paths 2 for ECMP across both spines.
BFD is enabled on all BGP sessions for sub-second failover.
| Device | BGP AS | Router-ID | Networks Advertised |
|---|---|---|---|
| Spine1 | 65000 | 10.2.0.1 | 10.2.0.1/32 |
| Spine2 | 65000 | 10.2.0.2 | 10.2.0.2/32 |
| Leaf1 | 65001 | 10.2.0.3 | 10.2.0.3/32, 10.3.0.1/32, 10.0.0.0/24 |
| Leaf2 | 65002 | 10.2.0.4 | 10.2.0.4/32, 10.3.0.2/32, 10.0.1.0/24 |
5.1 NX_AI_Spine1 — BGP Configuration
! NX_AI_Spine1 (192.168.51.232) - AS 65000 router bgp 65000 router-id 10.2.0.1 address-family ipv4 unicast network 10.2.0.1/32 ! Peer to Leaf1 via Eth1/14 neighbor 10.4.0.5 remote-as 65001 description to-NX-AI-Leaf1 bfd address-family ipv4 unicast ! Peer to Leaf2 via Eth1/18 neighbor 10.4.0.14 remote-as 65002 description to-NX-AI-Leaf2 bfd address-family ipv4 unicast
5.2 NX_AI_Spine2 — BGP Configuration
! NX_AI_Spine2 (192.168.51.231) - AS 65000 router bgp 65000 router-id 10.2.0.2 address-family ipv4 unicast network 10.2.0.2/32 ! Peer to Leaf1 via Eth1/13 neighbor 10.4.0.1 remote-as 65001 description to-NX-AI-Leaf1 bfd address-family ipv4 unicast ! Peer to Leaf2 via Eth1/17 neighbor 10.4.0.9 remote-as 65002 description to-NX-AI-Leaf2 bfd address-family ipv4 unicast
5.3 NX_AI_Leaf1 — BGP Configuration
! NX_AI_Leaf1 (192.168.50.229) - AS 65001 router bgp 65001 router-id 10.2.0.3 address-family ipv4 unicast network 10.2.0.3/32 network 10.3.0.1/32 network 10.0.0.0/24 ! Rail 0 SVI subnet maximum-paths 2 ! ECMP across both spines ! Peer to Spine1 via Eth1/14 neighbor 10.4.0.6 remote-as 65000 description to-NX-AI-Spine1 bfd address-family ipv4 unicast ! Peer to Spine2 via Eth1/13 neighbor 10.4.0.2 remote-as 65000 description to-NX-AI-Spine2 bfd address-family ipv4 unicast
5.4 NX_AI_Leaf2 — BGP Configuration
! NX_AI_Leaf2 (192.168.51.230) - AS 65002 router bgp 65002 router-id 10.2.0.4 address-family ipv4 unicast network 10.2.0.4/32 network 10.3.0.2/32 network 10.0.1.0/24 ! Rail 1 SVI subnet maximum-paths 2 ! ECMP across both spines ! Peer to Spine1 via Eth1/18 neighbor 10.4.0.13 remote-as 65000 description to-NX-AI-Spine1 bfd address-family ipv4 unicast ! Peer to Spine2 via Eth1/17 neighbor 10.4.0.10 remote-as 65000 description to-NX-AI-Spine2 bfd address-family ipv4 unicast
6. QoS / PFC / ECN Configuration (Lossless RDMA)
All 4 switches (Spine1, Spine2, Leaf1, Leaf2) run identical QoS configuration for lossless RoCE v2. RDMA traffic is classified by DSCP 26 and RoCE UDP ports (4741/4791), mapped to CoS 3 / qos-group 3, with Priority Flow Control (PFC) preventing packet drops and ECN signaling congestion before queues overflow.
Configuration audited and cleaned up February 2026.
Unused leftover class-maps (RDM, RDMA_2, RDMA_Class) and
policy-maps (ROCE_NET_POLICY, testcos) were removed from all switches.
6.1 Classification & Marking
ACL — RoCE UDP Port Matching
ip access-list rdma 10 permit udp any any eq 4741 20 permit udp any eq 4741 any 30 permit udp any eq 4791 any 40 permit udp any any eq 4791 ! UDP 4741 = RoCE v1, UDP 4791 = RoCE v2
Class Maps — RDMA Traffic Identification
! Match DSCP 26 (CS3/AF31 — standard RoCE marking) class-map type qos match-all RDMA match dscp 26 ! Match DSCP 26 OR RoCE UDP ports (broader catch-all) class-map type qos match-any RDMA_UDP match dscp 26 match access-group name rdma
Input Marking Policy
! Classify RDMA traffic into qos-group 3 for downstream processing policy-map type qos QOS_MARKING class RDMA set qos-group 3 class RDMA_UDP set qos-group 3
Classification Flow
Ingress packet → DSCP 26? or UDP 4741/4791? → qos-group 3 → CoS 3 queue → PFC protected + ECN marked
6.2 Network QoS (PFC + MTU)
! Network QoS: controls MTU per queue and PFC behavior policy-map type network-qos QOS_NETWORK class type network-qos c-nq3 mtu 9216 ← jumbo frames for RDMA queue pause pfc-cos 3 ← IEEE 802.1Qbb PFC on CoS 3 class type network-qos c-nq-default mtu 9216 ← jumbo for all other traffic too
Per-Interface PFC Settings
! Applied to ALL server-facing and fabric-facing interfaces: priority-flow-control mode on ! Global features enabled: feature lldp ← Link Layer Discovery Protocol feature dcbx ← Data Center Bridging Capability Exchange
How PFC Prevents RDMA Packet Loss
When a switch queue for CoS 3 fills to a threshold, PFC sends an IEEE 802.1Qbb PAUSE frame back to the upstream sender, telling it to stop transmitting on that priority class. This creates a lossless fabric — the upstream device buffers packets instead of the downstream device dropping them. Without PFC, RoCE performance degrades catastrophically because RDMA relies on the transport being lossless.
6.3 Egress Queuing (ECN + Priority)
! Egress queuing: scheduling + ECN for congestion signaling policy-map type queuing RDMA_ECN_OUT class type queuing c-out-q3 priority level 1 ← strict priority (lowest latency) random-detect threshold burst-optimized ecn ← DCQCN congestion signaling class type queuing c-out-q2 bandwidth remaining percent 0 class type queuing c-out-q1 bandwidth remaining percent 0 class type queuing c-out-q-default bandwidth remaining percent 100 ← all remaining BW for non-RDMA
System QoS Application
! Apply policies globally to the switching ASIC system qos service-policy type network-qos QOS_NETWORK ← PFC + MTU service-policy type queuing output RDMA_ECN_OUT ← ECN + scheduling
ECN + DCQCN Explained
ECN (Explicit Congestion Notification) marks packets with a congestion bit instead of dropping them. When the ConnectX NIC receives an ECN-marked packet, it triggers DCQCN (Data Center QCN) — the NIC reduces its sending rate proactively, preventing queue buildup before PFC needs to pause. This gives us a two-layer defense:
- Layer 1 — ECN/DCQCN: Proactive rate reduction (soft congestion signal)
- Layer 2 — PFC: Last-resort pause frames (hard flow control, prevents drops)
6.4 Design Summary
| Layer | Policy / Feature | Purpose | Key Setting |
|---|---|---|---|
| Input Classification | QOS_MARKING |
Identify RDMA traffic | DSCP 26 + UDP 4741/4791 → qos-group 3 |
| Network QoS | QOS_NETWORK |
Lossless transport | PFC pause on CoS 3, MTU 9216 |
| Egress Queuing | RDMA_ECN_OUT |
Priority + congestion | Queue 3 = strict priority + ECN |
| Interface | PFC mode on |
Per-port flow control | IEEE 802.1Qbb on all ports |
| Protocol | LLDP + DCBX |
Capability exchange | Negotiate PFC parameters with NICs |
QoS at a Glance
Cleanup Scripts
| Script | Purpose | Targets |
|---|---|---|
check_leaf1_qos.py |
Audit QoS/DCB/PFC configuration on Leaf1 | Leaf1 |
cleanup_leaf1_qos.py |
Remove unused class-maps & policy-maps from Leaf1 | Leaf1 |
check_all_qos.py |
Audit QoS configuration on all 4 switches | All 4 switches |
cleanup_qos_all.py |
Clean Leaf2 junk + add ACL rdma to both spines | Leaf2, Spine1, Spine2 |
7. RDMA Rail Design
Concept: Each leaf switch acts as a dedicated "rail" for one port of each dual-port NIC. This ensures deterministic, low-latency paths for RDMA traffic. NCCL binds each GPU to a specific NIC, and Linux policy routing ensures traffic from that NIC always goes through the correct leaf.
Rail 0 — Leaf1
- VLAN 100
- Subnet:
10.0.0.0/24 - SVI Gateway:
10.0.0.254 - Server NIC:
ens6d1 - Routing Table:
100 (rail0)
Rail 1 — Leaf2
- VLAN 101
- Subnet:
10.0.1.0/24 - SVI Gateway:
10.0.1.254 - Server NIC:
ens6 - Routing Table:
101 (rail1)
7.1 Rail 0 — Leaf1 / VLAN 100
! Leaf1 - Rail 0 Switch Configuration system jumbomtu 9216 interface Eth1/27 switchport access vlan 100 mtu 9216 no shutdown interface Eth1/28 switchport access vlan 100 mtu 9216 no shutdown interface Vlan100 no shutdown mtu 9216 ip address 10.0.0.254/24
7.2 Rail 1 — Leaf2 / VLAN 101
! Leaf2 - Rail 1 Switch Configuration system jumbomtu 9216 interface Eth1/27 switchport access vlan 101 mtu 9216 no shutdown interface Eth1/28 switchport access vlan 101 mtu 9216 no shutdown interface Vlan101 no shutdown mtu 9216 ip address 10.0.1.254/24
8. GPU Server Configuration
8.1 NIC IP Addressing & MTU
Each server has a dual-port Mellanox ConnectX-3 Pro NIC. Port 1 (ens6d1) connects to
Leaf1 (Rail 0) and Port 2 (ens6) connects to Leaf2 (Rail 1). MTU is set to 9000 on both NICs.
# ens6d1 (Rail 0 - Leaf1) ip addr add 10.0.0.1/24 dev ens6d1 ip link set ens6d1 mtu 9000 ip link set ens6d1 up # ens6 (Rail 1 - Leaf2) ip addr add 10.0.1.1/24 dev ens6 ip link set ens6 mtu 9000 ip link set ens6 up
# ens6d1 (Rail 0 - Leaf1) ip addr add 10.0.0.2/24 dev ens6d1 ip link set ens6d1 mtu 9000 ip link set ens6d1 up # ens6 (Rail 1 - Leaf2) ip addr add 10.0.1.2/24 dev ens6 ip link set ens6 mtu 9000 ip link set ens6 up
8.2 Per-NIC Policy Routing
How it works: Each NIC has its own Linux routing table. An ip rule matches
the source IP of outgoing packets to select the correct table. This ensures traffic originating from
ens6d1 always routes through Leaf1, and traffic from ens6 always routes through Leaf2.
NCCL chain: NCCL binds GPU → NIC → NIC has source IP →
ip rule matches source → correct routing table → correct leaf gateway.
# Step 1: Register routing table names in /etc/iproute2/rt_tables echo '100 rail0' >> /etc/iproute2/rt_tables echo '101 rail1' >> /etc/iproute2/rt_tables # Step 2: Rail 0 routing (ens6d1 → Leaf1 SVI 10.0.0.254) ip route add 10.0.0.0/24 dev ens6d1 scope link table 100 ip route add default via 10.0.0.254 dev ens6d1 table 100 ip rule add from <ens6d1_ip> table 100 # Step 3: Rail 1 routing (ens6 → Leaf2 SVI 10.0.1.254) ip route add 10.0.1.0/24 dev ens6 scope link table 101 ip route add default via 10.0.1.254 dev ens6 table 101 ip rule add from <ens6_ip> table 101
Policy Routing per Server
| Server | Rule: from | Table | Default Gateway | Via Device | Leaf |
|---|---|---|---|---|---|
| gpuserver1 | 10.0.0.1 | 100 (rail0) | 10.0.0.254 | ens6d1 | Leaf1 |
| gpuserver1 | 10.0.1.1 | 101 (rail1) | 10.0.1.254 | ens6 | Leaf2 |
| gpuserver2 | 10.0.0.2 | 100 (rail0) | 10.0.0.254 | ens6d1 | Leaf1 |
| gpuserver2 | 10.0.1.2 | 101 (rail1) | 10.0.1.254 | ens6 | Leaf2 |
8.3 Netplan Persistence
Configuration is persisted via /etc/netplan/60-rdma-rails.yaml on both servers (chmod 600).
# /etc/netplan/60-rdma-rails.yaml (gpuserver1 example) network: version: 2 ethernets: ens6d1: addresses: - 10.0.0.1/24 mtu: 9000 routing-policy: - from: 10.0.0.1 table: 100 routes: - to: 0.0.0.0/0 via: 10.0.0.254 table: 100 - to: 10.0.0.0/24 scope: link table: 100 ens6: addresses: - 10.0.1.1/24 mtu: 9000 routing-policy: - from: 10.0.1.1 table: 101 routes: - to: 0.0.0.0/0 via: 10.0.1.254 table: 101 - to: 10.0.1.0/24 scope: link table: 101
9. MTU Configuration
| Segment | MTU | Where |
|---|---|---|
| Spine-Leaf fabric links | 9216 | Eth1/13, 1/14, 1/17, 1/18 on all switches |
| Leaf SVIs (Vlan100, Vlan101) | 9216 | Leaf1 Vlan100, Leaf2 Vlan101 |
| Leaf server-facing ports | 9216 | Eth1/27, Eth1/28 on both leaves |
| System jumbomtu (L2) | 9216 | Both leaves (system jumbomtu 9216) |
| Server NICs | 9000 | ens6d1, ens6 on both GPU servers |
10. Change Log & Issues Resolved
Ran gather_leaf_state.py to collect VLANs, port status, IPs, BGP/OSPF configs, and CDP neighbors from all 4 switches. Found both leaves had Eth1/27 (VLAN 100) and Eth1/28 (VLAN 101) as access ports, but no SVIs. BGP was L2VPN EVPN only with iBGP AS 65101 + OSPF underlay.
Ran configure_rdma_rails.py. Configured Leaf1 with VLAN 100 SVI (10.0.0.254/24) and Leaf2 with VLAN 101 SVI (10.0.1.254/24). Set jumbo MTU on all fabric and server-facing links. Added BGP IPv4 unicast network statements.
Leaf1 had Vlan1 IP 10.0.0.3/24 and Leaf2 had Vlan1 IP 10.0.0.4/24 which overlapped with the Rail 0 subnet (10.0.0.0/24). On Leaf2, the directly connected Vlan1 route (AD 0) beat the BGP route (AD 200) to 10.0.0.0/24.
Fix: Ran fix_vlan1_conflict.py — removed IP from Vlan1 and shut it down on both leaves.
When configure_rdma_rails.py initially ran, NX-OS silently rejected the Vlan100 IP (10.0.0.254/24) because Vlan1 already had 10.0.0.3/24 in the same subnet. After removing Vlan1's IP, the SVI was still empty.
Fix: Ran fix_leaf1_svi.py — re-applied ip address 10.0.0.254/24 to Vlan100.
Ran migrate_ospf_to_bgp.py. Removed old iBGP (AS 65101) and OSPF from all 4 switches. Created new eBGP configuration with AS 65000 (spines), 65001 (Leaf1), 65002 (Leaf2). Physical interface peering with BFD. ECMP via maximum-paths 2 on leaves.
Ran configure_gpu_servers.py. Configured both servers with per-NIC IPs, MTU 9000, policy routing tables (100/101), and persistent netplan configuration. All cross-server reachability tests passed (same-rail and cross-rail).
Updated AI_Cluster_Topology.drawio with eBGP AS numbers, correct per-leaf VLANs, SVI IPs, server routing table info, and revised fabric summary.
11. Automation Scripts
All scripts are located in C:\Claude\AI_LAB\scripts\ and use Netmiko for SSH automation.
| Script | Purpose | Targets |
|---|---|---|
gather_leaf_state.py |
Collect current VLANs, ports, IPs, BGP/OSPF configs from all switches | All 4 switches |
configure_rdma_rails.py |
Configure VLAN+SVI, access ports, MTU, BGP IPv4 unicast | All 4 switches |
fix_vlan1_conflict.py |
Remove conflicting Vlan1 IPs overlapping with RDMA subnets | Leaf1, Leaf2 |
fix_leaf1_svi.py |
Re-apply missing IP address to Leaf1 Vlan100 SVI | Leaf1 |
diagnose_svi.py |
Diagnostic: check SVI state, running-config, IP interface status | Leaf1, Leaf2 |
verify_rdma_routing.py |
Verify BGP tables, summaries, and routes on all switches | All 4 switches |
migrate_ospf_to_bgp.py |
Migrate from iBGP+OSPF to eBGP with physical interface peering | All 4 switches |
configure_gpu_servers.py |
Configure per-NIC IPs, MTU, policy routing, netplan on GPU servers | gpuserver1, gpuserver2 |
12. Verification Results
eBGP Sessions — All Established
| Device | Neighbor | Remote AS | State |
|---|---|---|---|
| Spine1 | 10.4.0.5 (Leaf1) | 65001 | Established |
| Spine1 | 10.4.0.14 (Leaf2) | 65002 | Established |
| Spine2 | 10.4.0.1 (Leaf1) | 65001 | Established |
| Spine2 | 10.4.0.9 (Leaf2) | 65002 | Established |
| Leaf1 | 10.4.0.6 (Spine1) | 65000 | Established |
| Leaf1 | 10.4.0.2 (Spine2) | 65000 | Established |
| Leaf2 | 10.4.0.13 (Spine1) | 65000 | Established |
| Leaf2 | 10.4.0.10 (Spine2) | 65000 | Established |
Cross-Rail Routing — ECMP Working
Leaves have 2 equal-cost paths to remote rail subnets via both spines:
! Leaf1: route to Rail 1 subnet (10.0.1.0/24) - 2 paths 10.0.1.0/24, ubest/mbest: 2/0 *via 10.4.0.6, [20/0], BGP-65000 ← via Spine1 *via 10.4.0.2, [20/0], BGP-65000 ← via Spine2 ! Leaf2: route to Rail 0 subnet (10.0.0.0/24) - 2 paths 10.0.0.0/24, ubest/mbest: 2/0 *via 10.4.0.13, [20/0], BGP-65000 ← via Spine1 *via 10.4.0.10, [20/0], BGP-65000 ← via Spine2
Cross-Server Reachability — All Passed
| Test | From | To | Path | Result |
|---|---|---|---|---|
| Same-rail (Rail 0) | gpu1 10.0.0.1 | gpu2 10.0.0.2 | via Leaf1 only | PASS |
| Same-rail (Rail 1) | gpu1 10.0.1.1 | gpu2 10.0.1.2 | via Leaf2 only | PASS |
| Cross-rail | gpu1 Rail 0 (10.0.0.1) | gpu2 Rail 1 (10.0.1.2) | Leaf1 → Spine → Leaf2 | PASS |
| Cross-rail | gpu1 Rail 1 (10.0.1.1) | gpu2 Rail 0 (10.0.0.2) | Leaf2 → Spine → Leaf1 | PASS |
13. RDMA Performance Results
Tested with ib_write_bw and ib_write_lat (perftest suite) using RDMA Write operations
over RoCE (RDMA over Converged Ethernet). All tests run between gpuserver1 and gpuserver2.
| Parameter | Value |
|---|---|
| RDMA Device | rocep130s0 (Mellanox ConnectX-3 Pro) |
| Link Speed | 40 GbE per port |
| IB MTU | 4096 bytes (-m 4096) |
| Ethernet MTU | 9000 (servers) / 9216 (switches) |
| Mode | RoCE (-F flag), all message sizes (-a) |
| Connection | RC (Reliable Connection) |
13.1 Bandwidth Tests (ib_write_bw)
Results — IB MTU 4096
| Test | Path | Peak BW | Avg @ 8MB | % Wire Rate |
|---|---|---|---|---|
| Same Rail 0 | gpu1 10.0.0.1 ↔ gpu2 10.0.0.2 via Leaf1 |
38.98 Gb/s | 30.70 Gb/s | ~97% |
| Same Rail 1 | gpu1 10.0.1.1 ↔ gpu2 10.0.1.2 via Leaf2 |
38.95 Gb/s | 30.48 Gb/s | ~97% |
| Cross-Rail | gpu1 Rail0 10.0.0.1 → gpu2 Rail1 10.0.1.2 via Spine |
38.94 Gb/s | 37.36 Gb/s | ~97% |
MTU Comparison — 2048 vs 4096
| Test | Peak (MTU 2048) | Peak (MTU 4096) | Improvement | Cross-Rail Avg@8MB (2048) | Cross-Rail Avg@8MB (4096) |
|---|---|---|---|---|---|
| Same Rail 0 | 38.01 Gb/s | 38.98 Gb/s | +2.5% | — | |
| Same Rail 1 | 38.44 Gb/s | 38.95 Gb/s | +1.3% | — | |
| Cross-Rail | 38.07 Gb/s | 38.94 Gb/s | +2.3% | 33.17 Gb/s | 37.36 Gb/s (+12.6%) |
13.2 Latency Tests (ib_write_lat)
Results — IB MTU 4096
| Test | 2 bytes | 1 KB | 64 KB | 8 MB |
|---|---|---|---|---|
| Same Rail 0 | 2.92 μs | 4.56 μs | 18.70 μs | 2,137 μs |
| Same Rail 1 | 3.05 μs | 4.55 μs | 18.67 μs | 2,119 μs |
| Cross-Rail | 5.37 μs | 7.99 μs | 24.45 μs | 2,125 μs |
Latency Analysis
- Same-rail small message latency: ~3 μs — packet traverses Server NIC → Leaf switch → Server NIC (1 switch hop)
- Cross-rail adds ~2.4 μs — packet traverses Leaf → Spine → Leaf (3 switch hops instead of 1)
- At large sizes (8 MB), all paths converge to ~2.1 ms — serialization time dominates over switching latency
- Both rails symmetric — Leaf1 (Rail 0) and Leaf2 (Rail 1) perform identically
13.3 How Our Lab Compares
RDMA bypasses the kernel and TCP/IP stack entirely — data moves directly from NIC memory to NIC memory (zero-copy). This is why RDMA latency is orders of magnitude lower than regular TCP networking.
| Technology | Typical Latency | Notes |
|---|---|---|
| Our Same-Rail RDMA (40GbE RoCE) | ~3 μs | 1 switch hop (server → leaf → server) |
| Our Cross-Rail RDMA (via Spine) | ~5.4 μs | 3 switch hops (leaf → spine → leaf) |
| Typical TCP ping (same network) | ~100–300 μs | Kernel stack, context switches, TCP overhead |
| Regular Ethernet (no RDMA) | ~50–100 μs | Still goes through kernel networking stack |
| NVIDIA NVLink (GPU-to-GPU) | ~1–2 μs | Direct GPU interconnect within same server |
| PCIe (within same server) | ~0.5–1 μs | CPU-to-device within single machine |
Bandwidth Summary
Test Commands Reference
# Bandwidth test -- Server side (gpuserver2): ib_write_bw -d rocep130s0 -i <ib_port> --source_ip <server_ip> --port=<tcp_port> -m 4096 -F --report_gbits -a # Bandwidth test -- Client side (gpuserver1): ib_write_bw -d rocep130s0 -i <ib_port> --source_ip <client_ip> --port=<tcp_port> -m 4096 -F --report_gbits -a <server_ip> # Latency test -- same flags, replace ib_write_bw with ib_write_lat (no --report_gbits) ib_write_lat -d rocep130s0 -i <ib_port> --source_ip <ip> --port=<tcp_port> -m 4096 -F -a [server_ip] # IB Port mapping: # -i 1 = ens6 (Rail 1, 10.0.1.x via Leaf2) # -i 2 = ens6d1 (Rail 0, 10.0.0.x via Leaf1) # IMPORTANT: Use -i for IB port, NOT -p (which sets TCP port) # Use -m 4096 for jumbo IB MTU (default is only 2048) # Use unique --port values (19001, 19002...) to avoid TCP conflicts