-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Describe the bug
- Did you check if this is a duplicate issue?
- Did you test it on the latest FRRouting/frr master branch?
FRR is set up to peer with Arista switches to exchange IPv4 routes over IPv6 Link-Local BGP sessions (RFC5549).
This stops working sometimes, mostly after restarting FRR it seems. When this happens I noticed the following:
BGP neighbor output no longer shows extended-next-hop capability
Neighbor output displays that extended nexthop is received but not advertised:
client1rt# show bgp neighbors fabric0
BGP neighbor on fabric0: fe80::d6af:f7ff:fe91:46db, remote AS 4209900005, local AS 4209901001, external link
Member of peer-group EVPN-FABRIC for session parameters
BGP version 4, remote router ID 10.60.196.15, local router ID 10.60.197.1
BGP state = Established, up for 00:00:15
Last read 00:00:15, Last write 00:00:13
Hold time is 180, keepalive interval is 60 seconds
Neighbor capabilities:
4 Byte AS: advertised and received
Extended Message: advertised
AddPath:
IPv4 Unicast: RX advertised and received
Extended nexthop: received <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Address families by peer:
IPv4 Unicast
Long-lived Graceful Restart: advertised
Route refresh: advertised and received(new)
Enhanced Route Refresh: advertised and received
Address Family IPv4 Unicast: advertised and received
Hostname Capability: advertised (name: client1rt,domain name: n/a) not received
Graceful Restart Capability: advertised and received
Remote Restart timer is 300 seconds
Address families by peer:
none
Graceful restart information:
End-of-RIB send: IPv4 Unicast
End-of-RIB received: IPv4 Unicast
Local GR Mode: Helper*
Remote GR Mode: Helper
R bit: False
Timers:
Configured Restart Time(sec): 120
Received Restart Time(sec): 300
IPv4 Unicast:
F bit: False
End-of-RIB sent: Yes
End-of-RIB sent after update: Yes
End-of-RIB received: Yes
Timers:
Configured Stale Path Time(sec): 360
Message statistics:
Inq depth is 0
Outq depth is 0
Sent Rcvd
Opens: 2 2
Notifications: 2 0
Updates: 6 20
Keepalives: 4 7
Route Refresh: 0 0
Capability: 0 0
Total: 14 29
Minimum time between advertisement runs is 0 seconds
For address family: IPv4 Unicast
EVPN-FABRIC peer-group member
Update group 1, subgroup 1
Packet Queue length 0
Inbound soft reconfiguration allowed
Community attribute sent to this neighbor(all)
Inbound path policy configured
Outbound path policy configured
Route map for incoming advertisements is *PERMIT-ANY
Route map for outgoing advertisements is *LOCAL-LOOPBACKS
0 accepted prefixes
Maximum prefixes allowed 10000
Threshold for warning message 75%
Connections established 2; dropped 1
Last reset 00:00:16, No AFI/SAFI activated for peer
Local host: fe80::a236:9fff:fe3e:509a, Local port: 179
Foreign host: fe80::d6af:f7ff:fe91:46db, Foreign port: 46323
Nexthop: 10.60.197.1
Nexthop global: fe80::a236:9fff:fe3e:509a
Nexthop local: fe80::a236:9fff:fe3e:509a
BGP connection: shared network
BGP Connect Retry Timer in Seconds: 120
Peer Authentication Enabled
Read thread: on Write thread: on FD used: 27
Also in the BGP router config the neighbor suddenly has no neighbor fabric0 capability extended-nexthop
even though it is active through the peer-group and was not configured by me. The line just turns up for all neighbors in the peer group.
router bgp 4209901001
no bgp ebgp-requires-policy
no bgp default ipv4-unicast
bgp bestpath as-path multipath-relax
neighbor EVPN-FABRIC peer-group
neighbor EVPN-FABRIC password XXX
neighbor EVPN-FABRIC capability extended-nexthop
neighbor EVPN-OVERLAY-PEERS peer-group
neighbor EVPN-OVERLAY-PEERS bfd
neighbor EVPN-OVERLAY-PEERS bfd profile EVPN-FABRIC
neighbor EVPN-OVERLAY-PEERS password XXX
neighbor EVPN-OVERLAY-PEERS ebgp-multihop 3
neighbor EVPN-OVERLAY-PEERS update-source evpn0
neighbor fabric0 interface peer-group EVPN-FABRIC
neighbor fabric0 remote-as 4209900005
no neighbor fabric0 capability extended-nexthop <<<<<<<<<<<<<<<<<<<<<<<<<<
neighbor fabric2 interface peer-group EVPN-FABRIC
neighbor fabric2 remote-as 4209900006
no neighbor fabric2 capability extended-nexthop <<<<<<<<<<<<<<<<<<<<<<<<<<
neighbor 10.60.196.15 remote-as 4209900005
neighbor 10.60.196.15 peer-group EVPN-OVERLAY-PEERS
!
address-family ipv4 unicast
redistribute connected route-map LOOPBACK-HOSTIPS
neighbor EVPN-FABRIC activate
neighbor EVPN-FABRIC soft-reconfiguration inbound
neighbor EVPN-FABRIC maximum-prefix 10000
neighbor EVPN-FABRIC route-map PERMIT-ANY in
neighbor EVPN-FABRIC route-map LOCAL-LOOPBACKS out
maximum-paths 2
exit-address-family
Even when I reactive the capability with neighbor fabric0 capability extended-nexthop the problem persists after BGP is reset. Some combination of restarting FRR and changing configuration then fixes this again, but I can't make out a pattern.
Output on the Arista side shows an error when BGP is established and also has routes with the wrong next-hop:
Apr 27 16:26:26 leaf1 Bgp: %BGP-3-DROP_TXUPDATE: Dropped updates for peer fe80::a236:9fff:fe3e:509a%Et2 (VRF default AS 4209901001) because a local Nexthop was not configured for AFI/SAFI IPv4/Unicast (message repeated 2 times in 78.1729 secs)
#show bgp neighbors fe80::a236:9fff:fe3e:509a%Et2 ipv4 unicast received-routes
BGP routing table information for VRF default
Router identifier 10.60.196.15, local AS number 4209900005
Route status codes: s - suppressed, * - valid, > - active, E - ECMP head, e - ECMP
S - Stale, c - Contributing to ECMP, b - backup, L - labeled-unicast
% - Pending BGP convergence
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI Origin Validation codes: V - valid, I - invalid, U - unknown
AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Link Local Nexthop
Network Next Hop Metric AIGP LocPref Weight Path
10.60.197.1/32 10.60.197.1 0 - - - 4209901001 ?
10.61.197.1/32 10.60.197.1 0 - - - 4209901001 ?
To Reproduce
Not sure how to reproduce, problem occurs pretty often, mostly after restarting FRR.
Expected behavior
When it works next-hop is IPv6 on the Arista side as expected:
#show bgp neighbors fe80::a236:9fff:fe3e:509a%Et2 ipv4 unicast received-routes
BGP routing table information for VRF default
Router identifier 10.60.196.15, local AS number 4209900005
Route status codes: s - suppressed, * - valid, > - active, E - ECMP head, e - ECMP
S - Stale, c - Contributing to ECMP, b - backup, L - labeled-unicast
% - Pending BGP convergence
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI Origin Validation codes: V - valid, I - invalid, U - unknown
AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Link Local Nexthop
Network Next Hop Metric AIGP LocPref Weight Path
* > 10.60.197.1/32 fe80::a236:9fff:fe3e:509a%Et2 0 - - - 4209901001 ?
* > 10.61.197.1/32 fe80::a236:9fff:fe3e:509a%Et2 0 - - - 4209901001 ?
Also in FRR the extended next-hop is advertised:
client1rt# show bgp neighbors fabric0
BGP neighbor on fabric0: fe80::d6af:f7ff:fe91:46db, remote AS 4209900005, local AS 4209901001, external link
Member of peer-group EVPN-FABRIC for session parameters
BGP version 4, remote router ID 10.60.196.15, local router ID 10.60.197.1
BGP state = Established, up for 00:14:44
Last read 00:00:05, Last write 00:00:44
Hold time is 180, keepalive interval is 60 seconds
Neighbor capabilities:
4 Byte AS: advertised and received
Extended Message: advertised
AddPath:
IPv4 Unicast: RX advertised and received
Extended nexthop: advertised and received <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Address families by peer:
IPv4 Unicast
Long-lived Graceful Restart: advertised
Route refresh: advertised and received(new)
Enhanced Route Refresh: advertised and received
Address Family IPv4 Unicast: advertised and received
Hostname Capability: advertised (name: client1rt,domain name: n/a) not received
Graceful Restart Capability: advertised and received
Remote Restart timer is 300 seconds
Address families by peer:
none
Graceful restart information:
End-of-RIB send: IPv4 Unicast
End-of-RIB received: IPv4 Unicast
Local GR Mode: Helper*
Remote GR Mode: Helper
R bit: False
Timers:
Configured Restart Time(sec): 120
Received Restart Time(sec): 300
IPv4 Unicast:
F bit: False
End-of-RIB sent: Yes
End-of-RIB sent after update: Yes
End-of-RIB received: Yes
Timers:
Configured Stale Path Time(sec): 360
Message statistics:
Inq depth is 0
Outq depth is 0
Sent Rcvd
Opens: 1 1
Notifications: 0 0
Updates: 3 12
Keepalives: 15 19
Route Refresh: 0 0
Capability: 0 0
Total: 19 32
Minimum time between advertisement runs is 0 seconds
For address family: IPv4 Unicast
EVPN-FABRIC peer-group member
Update group 1, subgroup 1
Packet Queue length 0
Inbound soft reconfiguration allowed
Community attribute sent to this neighbor(all)
Inbound path policy configured
Outbound path policy configured
Route map for incoming advertisements is *PERMIT-ANY
Route map for outgoing advertisements is *LOCAL-LOOPBACKS
17 accepted prefixes
Maximum prefixes allowed 10000
Threshold for warning message 75%
Connections established 1; dropped 0
Last reset 00:15:53, Waiting for peer OPEN
Local host: fe80::a236:9fff:fe3e:509a, Local port: 41550
Foreign host: fe80::d6af:f7ff:fe91:46db, Foreign port: 179
Nexthop: 10.60.197.1
Nexthop global: fe80::a236:9fff:fe3e:509a
Nexthop local: fe80::a236:9fff:fe3e:509a
BGP connection: shared network
BGP Connect Retry Timer in Seconds: 120
Estimated round trip time: 1 ms
Peer Authentication Enabled
Read thread: on Write thread: on FD used: 29
Versions
- OS Version: Debian Bullseye (11.3)
- Kernel: Linux client1rt 5.10.0-12-amd64 A small collection of patches that fix issues found by valgrind #1 SMP Debian 5.10.103-1 (2022-03-07) x86_64 GNU/Linux
- FRR Version: 8.2.2-0~deb11u1 (from FRR deb repository)