Skip to content

[EXPERIMENT] pull-mode flooding#2432

Closed
graydon wants to merge 3 commits intostellar:masterfrom
graydon:better-flooding
Closed

[EXPERIMENT] pull-mode flooding#2432
graydon wants to merge 3 commits intostellar:masterfrom
graydon:better-flooding

Conversation

@graydon
Copy link
Contributor

@graydon graydon commented Feb 11, 2020

Pull-mode flooding

High-level explanation of changes:

The change aims to reduce the amount of redundant flood-type traffic (that is, a message received by a peer along multiple connections). It does so in a very conventional fashion: switching from "eager" to "lazy" flooding.

Specifically, flood-type traffic is no longer propagated to neighbours as soon as it's received; rather it's advertized with much smaller and denser batches of short (64 bit SIPHash-2-4) keyed hash codes. Advertisements are sent fairly often but there's a slight buffering delay to allow combining a decent-sized number of them into a single message. When an advertisement is received, the receiver immediately responds with a demand for all the hashes it hasn't yet seen, which the sender then immediately sends using the existing message types. Since advertisements are typically less than a tenth the size of the messages they're advertizing, this reduces the total traffic accordingly.

This change only modifies the flow of data between the floodgate and the IO queues -- data we've already committed to informing our neighbours about. It is thus fairly narrow and late in the full set of system queueing disciplines, and does not change most decisions around tx-queue throttling, surge-pricing, SCP rebroadcasting, etc.

It also only modifies flood traffic -- fetch traffic is unaffected.

It does add 2 WAN RTTs to flood traffic latency, and adds two possible attack vectors (that I am aware of):

  • a simple/dumb cache-poisoning vector where an attacker advertizes a message but then doesn't provide it when it's demanded. This is mitigated by making demands short-lived with a timeout, and allowing peers to race when re-advertizing a hash with a pending demand that's timed out.
  • a fancier cryptographic cache-poisoning vector where an attacker could send a message that collides with a target hash, inhibiting its peer from even demanding it from anyone (i.e. it thinks it already has message X when it really has Y, but hash(X) == hash(Y)). The hashes are keyed by LCL so this only works if an attacker can both collide a SipHash-2-4 in 5 seconds and get their packet accepted before the target packet, so the odds seem tolerable to me, but it's worth considering (there are other options here including just using SHA256s, as I did in an earlier iteration, but the SipHash variant uses 1/4 as much bandwidth, and this is all about reducing bandwidth.)

The change can coexist with existing (eager) connections that do not yet support lazy flooding, and in fact can decide message-by-message which mode to use, so one way to further mitigate both the cache-poisoning vectors and any potential latency problems is to vary the entire flooding strategy stochastically (eg. send 75% of traffic lazily and 25% eagerly). My plan is to do this and make it a config option, because it's a simple dial to understand and also allows the operator to just turn it off if it's misbehaving.

A more sophisticated strategy could opt to set a per-link policy to construct (eg.) an eager spanning tree with a backup lazy mesh (as in plumtree) but this has potentially more complex/fragile dynamics so I am deferring it for now.

As a "prior art" / comparing-notes aside: this is similar to Bitcoin Core's "compact block" / low-bandwidth mode BIP-152 except they use even smaller 6-byte truncated SipHash-2-4s (again keyed with the previous block hash).

Detailed explanation of changes:

  • The overlay protocol gets 2 new message types:

    • FLOOD_ADVERT: a list of short hashes available to be sent
    • FLOOD_DEMAND: a list of short hashes requested to be sent
  • Each Peer object gets 2 new fields:

    • A "pending" FLOOD_ADVERT, which accumulates outgoing 64-bit advert-codes (until it's ready to be flushed)
    • A timer
  • The Peer flushes its pending advert (sending it to its remote peer) on the earliest of 3 conditions:

    • The IO loop finishes processing an incoming (large) read buffer from the remote
    • The timer expires (every 100ms)
    • A threshold of pending advert-codes is reached (1024)
  • Floodgate::FloodRecord gets 2 new fields:

    • a short hash of the message keyed by the low 128 bits of the LCL to use when in sync
    • a short hash of the message keyed by all zeroes to use when out of sync
  • The FloodGate gets 2 new maps:

    • An index from short hashes to the existing FloodRecords in mFloodMap
    • A map from demanded (but not yet received) short hashes to the ledger number they were demanded at, used to suppress multiple demands from being issued simultaneously, but also to eventually allow re-demanding if a demand is not fulfilled.
  • The implementations of FloodGate::{addRecord,broadcast} are partly merged into a new FloodGate::insert method, and broadcast is modified to push an advert into a Peer's pending FLOOD_ADVERT when possible rather than sending a full eager message.

  • FloodGate and OverlayManager get 2 new methods:

    • demandMissing, called in response to receiving FLOOD_ADVERT. Sends a FLOOD_DEMAND for all the not-yet-known messages in the advert.
    • fulfillDemand, called in response to receiving a FLOOD_DEMAND. Sends all the messages that were demanded.

@graydon
Copy link
Contributor Author

graydon commented Feb 17, 2020

Likely improvement: flush adverts to all peers after a given buffer-read rather than just the peer the read finished on.

@graydon graydon force-pushed the better-flooding branch 3 times, most recently from df69ba1 to 90a1bb5 Compare July 13, 2021 05:13
@MonsieurNicolas
Copy link
Contributor

this was replaced by #3479

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants