Skip to content

R18 again#638

Merged
taspelund merged 7 commits into
mainfrom
trey/r18-again
Feb 11, 2026
Merged

R18 again#638
taspelund merged 7 commits into
mainfrom
trey/r18-again

Conversation

@taspelund

@taspelund taspelund commented Feb 11, 2026

Copy link
Copy Markdown
Contributor

Couple important fixes for R18 plus a couple mgadm quality of life fixes.

Important fixes:

  • add handling for ErrorKind::WouldBlock in recv_msg() to avoid tearing down TCP session when a read returns EWOULDBLOCK/EAGAIN for a non-blocking socket (this fixes large update message regression #634) -- when moving the TCP socket to non-blocking, this read call was missed when adding error handling for WouldBlock.
  • fix RA router lifetime (arista workaround) update path -- NDP manager interface config would be updated, but the tx loop would never restart or re-read the config. Now it stores the lifetime in an AtomicU16 that can be updated without restarting the tx loop.

mgadm:

  • replace humantime duration formatting with more readable formatter
  • re-order ASN to final positional arg in NDP commands so environ works

Other:

  • add validation of rx BGP message length
    - set TCP_NODELAY on BGP sockets

details for all changes are in the individual commit messages

Updates to BGP unnumbered peers that change the Router Lifetime field of
the NDP Router Advertisement (via mgadm --act-as-default-router arg)
would propagate through the DB and into the NDP interface state, but the
tx thread had no way of picking up the new value without restarting...
and there was no logic to restart the tx thread. This swaps out the u16
holding the Router Lifetime for an AtomicU16 and adds logic to do an
atomic store so the tx thread can use the new value without requiring a
restart.
recv_header handles WouldBlock from the socket read timeout by retrying,
but recv_msg uses read_exact for the message body, which propagates
WouldBlock as a fatal IO error. With large BGP UPDATE messages the
body may not arrive within a single timeout window, causing a spurious
TcpConnectionFails.

Replace read_exact with an incremental retry loop. Introduce
RecvError::Shutdown to cleanly separate dropped-flag exits from real IO
errors, removing the defensive dropped check from the IO error path.

Fixes: #634
Header::from_wire does not check that the length field is between 19
and 4096. This means recv_msg accepts oversized headers and attempts to
allocate and read arbitrarily large message bodies. Undersized lengths
would underflow the body size calculation.

Add length bounds checks in recv_msg after header parsing. Out-of-range
lengths produce a HeaderParseError that the FSM translates into
a NOTIFICATION with Bad Message Length, per RFC 4271 section 4.1.
Validation is intentionally in the connection layer rather than
from_wire so the session can send the NOTIFICATION before closing.
@taspelund taspelund self-assigned this Feb 11, 2026
@taspelund taspelund added Bug bgp Border Gateway Protocol mgd Maghemite daemon customer For any bug reports or feature requests tied to customer requests labels Feb 11, 2026
Comment thread bgp/src/connection_tcp.rs Outdated
direction: ConnectionDirection,
config: &SessionInfo,
) -> Result<Self, Error> {
conn.set_nodelay(true)?;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what prompted this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nagle came to mind when thinking through potential causes of latency and I noticed we didn't disable it. There isn't a specific issue motivating this, just that in modern deployments there's not much point in leaving it enabled.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting for this PR. Will look at re-adding after R18 is cut

@taspelund taspelund merged commit 937fdab into main Feb 11, 2026
15 checks passed
@taspelund taspelund deleted the trey/r18-again branch February 11, 2026 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bgp Border Gateway Protocol Bug customer For any bug reports or feature requests tied to customer requests mgd Maghemite daemon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

large update message regression

2 participants