Skip to content

BGP neighbor stuck in Active state (TCP_connection_open_failed) #12706

@stepanblyschak

Description

@stepanblyschak

Describe the bug

  • Did you check if this is a duplicate issue?
  • Did you test it on the latest FRRouting/frr master branch?

To Reproduce

  1. Zebra is running with -M dplane_fpm_nl --asic-offload=notify_on_offload. (Does not reproduce on the same version when using old FPM plugin -M fpm).
  2. Set ip nht resolve-via-default
  3. Configure neighbor that should be resolved via default route:
neighbor 11.0.0.1 remote-as 65100
neighbor 11.0.0.1 peer-group BGPMON
neighbor 11.0.0.1 description mon

Full FRR configuration is attached config.txt

The session with 11.0.0.1 is stuck in Active state:

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
11.0.0.1        4      65100         0         0        0    0    0    never       Active        0 mon

It is stated that it is "Waiting for NHT":

r-tigon-21# show ip bgp neighbors 11.0.0.1
BGP neighbor is 11.0.0.1, remote AS 65100, local AS 65100, internal link
 Description: mon
 Member of peer-group BGPMON for session parameters
  BGP version 4, remote router ID 0.0.0.0, local router ID 10.1.0.32
  BGP state = Active
  Last read 00:05:27, Last write never
  Hold time is 180, keepalive interval is 60 seconds
  Graceful restart information:
    Local GR Mode: Helper*
    Remote GR Mode: NotApplicable
    R bit: False
    Timers:
      Configured Restart Time(sec): 120
      Received Restart Time(sec): 0
  Message statistics:
    Inq depth is 0
    Outq depth is 0
                         Sent       Rcvd
    Opens:                  0          0
    Notifications:          0          0
    Updates:                0          0
    Keepalives:             0          0
    Route Refresh:          0          0
    Capability:             0          0
    Total:                  0          0
  Minimum time between advertisement runs is 0 seconds
  Update source is 10.1.0.32

 For address family: IPv4 Unicast
  BGPMON peer-group member
  Not part of any update group
  Community attribute sent to this neighbor(all)
  Inbound path policy configured
  Outbound path policy configured
  Route map for incoming advertisements is *FROM_BGPMON
  Route map for outgoing advertisements is *TO_BGPMON
  0 accepted prefixes
  Maximum prefixes allowed 1
  Threshold for warning message 75%

 For address family: IPv6 Unicast
  BGPMON peer-group member
  Not part of any update group
  Community attribute sent to this neighbor(all)
  0 accepted prefixes

  Connections established 0; dropped 0
  Last reset 00:05:27,  Waiting for NHT
BGP Connect Retry Timer in Seconds: 120
Next connect timer due in 33 seconds
Read thread: off  Write thread: off  FD used: -1

Default route is learned and installed in FIB:

r-tigon-21# show ip route 0.0.0.0/0
Routing entry for 0.0.0.0/0
  Known via "bgp", distance 20, metric 0, best
  Last update 00:09:52 ago
  * 10.0.0.1, via PortChannel102, weight 1
  * 10.0.0.5, via PortChannel105, weight 1
  * 10.0.0.9, via PortChannel108, weight 1
  * 10.0.0.13, via PortChannel1011, weight 1

show ip nht says 11.0.0.1 is resolved:

11.0.0.1
 resolved via bgp
 via 10.0.0.1, PortChannel102
 via 10.0.0.5, PortChannel105
 via 10.0.0.9, PortChannel108
 via 10.0.0.13, PortChannel1011
 Client list: bgp(fd 39)

show bgp nexthop:

 11.0.0.1 invalid, #paths 0, peer 11.0.0.1
  Last update: Mon Jan 30 14:31:59 2023

Not sure, why the status is invalid.

Some debug logs:

Jan 27 13:00:48.776062 r-leopard-58 INFO bgp#supervisord: bgpd 2023/01/27 13:00:48 BGP: [ZWCSR-M7FG9] 11.0.0.1 [FSM] BGP_Stop (Active->Idle), fd -1
Jan 27 13:00:48.776104 r-leopard-58 INFO bgp#supervisord: bgpd 2023/01/27 13:00:48 BGP: [ZQHFG-DQGX1] 11.0.0.1 went from Active to Idle
Jan 27 13:00:56.441204 r-leopard-58 INFO bgp#supervisord: bgpd 2023/01/27 13:00:56 BGP: [ZQTB5-H8522] 11.0.0.1 [FSM] Timer (start timer expire).
Jan 27 13:00:56.441262 r-leopard-58 INFO bgp#supervisord: bgpd 2023/01/27 13:00:56 BGP: [ZWCSR-M7FG9] 11.0.0.1 [FSM] BGP_Start (Idle->Connect), fd -1
Jan 27 13:00:56.441262 r-leopard-58 INFO bgp#supervisord: bgpd 2023/01/27 13:00:56 BGP: [J47J0-K06GG] Found existing bnc 11.0.0.1/32(VRF default) flags 0xa ifindex 0 #paths 0 peer 0x7f2ee82e8010
Jan 27 13:00:56.441262 r-leopard-58 INFO bgp#supervisord: bgpd 2023/01/27 13:00:56 BGP: [T72VK-55DVG] 11.0.0.1 [FSM] Waiting for NHT
Jan 27 13:00:56.441299 r-leopard-58 INFO bgp#supervisord: bgpd 2023/01/27 13:00:56 BGP: [ZQHFG-DQGX1] 11.0.0.1 went from Idle to Connect
Jan 27 13:00:56.441630 r-leopard-58 INFO bgp#supervisord: bgpd 2023/01/27 13:00:56 BGP: [ZWCSR-M7FG9] 11.0.0.1 [FSM] TCP_connection_open_failed (Connect->Active), fd -1
Jan 27 13:00:56.441666 r-leopard-58 INFO bgp#supervisord: bgpd 2023/01/27 13:00:56 BGP: [ZQHFG-DQGX1] 11.0.0.1 went from Connect to Active

Expected behavior

Our test expects FRR to enter Connect state and send TCP SYN packet trough one of next hops of the default route.
This test passes when zebra is started with -M fpm but not with the new FPM plugin -M dplane_fpm_nl and an option to notify offload --asic-offload=notify_on_offload we recently integrated into SONiC OS.

Screenshots

Versions

  • OS Version: SONiC OS (Debian 11 based)
  • Kernel: 5.10.0-18-2-amd64

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    triageNeeds further investigation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions