-
Notifications
You must be signed in to change notification settings - Fork 1.5k
BGP neighbor stuck in Active state (TCP_connection_open_failed) #12706
Copy link
Copy link
Closed
Labels
triageNeeds further investigationNeeds further investigation
Description
Describe the bug
- Did you check if this is a duplicate issue?
- Did you test it on the latest FRRouting/frr master branch?
To Reproduce
- Zebra is running with
-M dplane_fpm_nl --asic-offload=notify_on_offload. (Does not reproduce on the same version when using old FPM plugin-M fpm). - Set
ip nht resolve-via-default - Configure neighbor that should be resolved via default route:
neighbor 11.0.0.1 remote-as 65100
neighbor 11.0.0.1 peer-group BGPMON
neighbor 11.0.0.1 description mon
Full FRR configuration is attached config.txt
The session with 11.0.0.1 is stuck in Active state:
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
11.0.0.1 4 65100 0 0 0 0 0 never Active 0 mon
It is stated that it is "Waiting for NHT":
r-tigon-21# show ip bgp neighbors 11.0.0.1
BGP neighbor is 11.0.0.1, remote AS 65100, local AS 65100, internal link
Description: mon
Member of peer-group BGPMON for session parameters
BGP version 4, remote router ID 0.0.0.0, local router ID 10.1.0.32
BGP state = Active
Last read 00:05:27, Last write never
Hold time is 180, keepalive interval is 60 seconds
Graceful restart information:
Local GR Mode: Helper*
Remote GR Mode: NotApplicable
R bit: False
Timers:
Configured Restart Time(sec): 120
Received Restart Time(sec): 0
Message statistics:
Inq depth is 0
Outq depth is 0
Sent Rcvd
Opens: 0 0
Notifications: 0 0
Updates: 0 0
Keepalives: 0 0
Route Refresh: 0 0
Capability: 0 0
Total: 0 0
Minimum time between advertisement runs is 0 seconds
Update source is 10.1.0.32
For address family: IPv4 Unicast
BGPMON peer-group member
Not part of any update group
Community attribute sent to this neighbor(all)
Inbound path policy configured
Outbound path policy configured
Route map for incoming advertisements is *FROM_BGPMON
Route map for outgoing advertisements is *TO_BGPMON
0 accepted prefixes
Maximum prefixes allowed 1
Threshold for warning message 75%
For address family: IPv6 Unicast
BGPMON peer-group member
Not part of any update group
Community attribute sent to this neighbor(all)
0 accepted prefixes
Connections established 0; dropped 0
Last reset 00:05:27, Waiting for NHT
BGP Connect Retry Timer in Seconds: 120
Next connect timer due in 33 seconds
Read thread: off Write thread: off FD used: -1
Default route is learned and installed in FIB:
r-tigon-21# show ip route 0.0.0.0/0
Routing entry for 0.0.0.0/0
Known via "bgp", distance 20, metric 0, best
Last update 00:09:52 ago
* 10.0.0.1, via PortChannel102, weight 1
* 10.0.0.5, via PortChannel105, weight 1
* 10.0.0.9, via PortChannel108, weight 1
* 10.0.0.13, via PortChannel1011, weight 1
show ip nht says 11.0.0.1 is resolved:
11.0.0.1
resolved via bgp
via 10.0.0.1, PortChannel102
via 10.0.0.5, PortChannel105
via 10.0.0.9, PortChannel108
via 10.0.0.13, PortChannel1011
Client list: bgp(fd 39)
show bgp nexthop:
11.0.0.1 invalid, #paths 0, peer 11.0.0.1
Last update: Mon Jan 30 14:31:59 2023
Not sure, why the status is invalid.
Some debug logs:
Jan 27 13:00:48.776062 r-leopard-58 INFO bgp#supervisord: bgpd 2023/01/27 13:00:48 BGP: [ZWCSR-M7FG9] 11.0.0.1 [FSM] BGP_Stop (Active->Idle), fd -1
Jan 27 13:00:48.776104 r-leopard-58 INFO bgp#supervisord: bgpd 2023/01/27 13:00:48 BGP: [ZQHFG-DQGX1] 11.0.0.1 went from Active to Idle
Jan 27 13:00:56.441204 r-leopard-58 INFO bgp#supervisord: bgpd 2023/01/27 13:00:56 BGP: [ZQTB5-H8522] 11.0.0.1 [FSM] Timer (start timer expire).
Jan 27 13:00:56.441262 r-leopard-58 INFO bgp#supervisord: bgpd 2023/01/27 13:00:56 BGP: [ZWCSR-M7FG9] 11.0.0.1 [FSM] BGP_Start (Idle->Connect), fd -1
Jan 27 13:00:56.441262 r-leopard-58 INFO bgp#supervisord: bgpd 2023/01/27 13:00:56 BGP: [J47J0-K06GG] Found existing bnc 11.0.0.1/32(VRF default) flags 0xa ifindex 0 #paths 0 peer 0x7f2ee82e8010
Jan 27 13:00:56.441262 r-leopard-58 INFO bgp#supervisord: bgpd 2023/01/27 13:00:56 BGP: [T72VK-55DVG] 11.0.0.1 [FSM] Waiting for NHT
Jan 27 13:00:56.441299 r-leopard-58 INFO bgp#supervisord: bgpd 2023/01/27 13:00:56 BGP: [ZQHFG-DQGX1] 11.0.0.1 went from Idle to Connect
Jan 27 13:00:56.441630 r-leopard-58 INFO bgp#supervisord: bgpd 2023/01/27 13:00:56 BGP: [ZWCSR-M7FG9] 11.0.0.1 [FSM] TCP_connection_open_failed (Connect->Active), fd -1
Jan 27 13:00:56.441666 r-leopard-58 INFO bgp#supervisord: bgpd 2023/01/27 13:00:56 BGP: [ZQHFG-DQGX1] 11.0.0.1 went from Connect to Active
Expected behavior
Our test expects FRR to enter Connect state and send TCP SYN packet trough one of next hops of the default route.
This test passes when zebra is started with -M fpm but not with the new FPM plugin -M dplane_fpm_nl and an option to notify offload --asic-offload=notify_on_offload we recently integrated into SONiC OS.
Screenshots
Versions
- OS Version: SONiC OS (Debian 11 based)
- Kernel: 5.10.0-18-2-amd64
- FRR Version: FRR 8.2.2 with patches from 8.4 (https://github.com/stepanblyschak/sonic-buildimage/tree/bgp-suppress-fib-pending/src/sonic-frr/patch)
Additional context
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
triageNeeds further investigationNeeds further investigation