p2p: revised router message scheduling#6126
Conversation
|
A basic benchmark against the e2e simple network with load 50-75 txs/block before getting degraded. Eventually, the network is only able to handle 15-20 txs/block...not sure why, but I doubt its p2p related. From this, we see that reading off of peer queues before sending on the TCP connection takes the most time. |
Codecov Report
@@ Coverage Diff @@
## master #6126 +/- ##
==========================================
- Coverage 60.76% 60.76% -0.01%
==========================================
Files 279 281 +2
Lines 26321 26614 +293
==========================================
+ Hits 15993 16171 +178
- Misses 8651 8760 +109
- Partials 1677 1683 +6
|
|
@erikgrinaker would love your thoughts on the following. I've spent quite some time researching network and process queuing disciplines and their various tradeoffs. I have no means come close to understanding all the disciplines and their tradeoffs, nor do I really want to spend too much time on doing so. I've arrived at different options that we can take from here given the nature our now legacy p2p stack (i.e. priorities) for outbound message to peers:
|
|
I definitely agree that we should start out with something simple. Both of these approaches seem viable, but I think I'd prefer DRR over a priority heap -- the heap always picks a higher priority, which can cause high-priority reactors to completely starve low-priority ones, and I think it makes sense to let low-priority reactors at least send some messages during contention. I was going to suggest WFQ, but DWRR appears roughly equivalent and takes into account payload sizes as well, which is useful since we're scheduling at message granularity and a 10MB state sync chunk shouldn't count the same as a 1KB consensus vote. The plan was originally to schedule at the byte level, but this would require the stream-based transport API which has been deferred -- when/if this is implemented, we should rework the scheduling for this. Also note that because of the large message size variance, queue capacities should probably be in terms of bytes rather than number of messages (or possibly both), using You may want to have a look at the tendermint/p2p/conn/connection.go Lines 521 to 536 in 5a39f72 |
|
Thanks for the input @erikgrinaker. I believe I have enough info to get started on this now. |
|
@erikgrinaker I was hoping you could spare a few minutes and take a look at the queue implementation. Specifically in regards to weights, priorities and the individual flow buffer sizes. I'm having difficulties mapping old priorities and max message sizes to the new queue logic. In particular I see idle block production slow down to ~10s no matter what values I pick. EDIT: I only see good idle or with load block times when I make it "fair", i.e. same quantum and buffer capacity for all channels/flows. Either my implementation is wrong or picking quantums based on "priority" is going to be very difficult. Do we need it to be weighted? |
There was a problem hiding this comment.
I'm having difficulties mapping old priorities and max message sizes to the new queue logic. In particular I see idle block production slow down to ~10s no matter what values I pick.
The simple E2E network (or any of them really) shouldn't produce anywhere near enough traffic to have any contention, so the scheduler should only have a negligible effect on latency. If the E2E test times vary to a statistically significant extent with the WDRR scheduler then the scheduler needs to be improved.
I only see good idle or with load block times when I make it "fair", i.e. same quantum and buffer capacity for all channels/flows. Either my implementation is wrong or picking quantums based on "priority" is going to be very difficult.
Sounds like there's probably something wrong with the scheduler -- but also note (as mentioned elsewhere) that the scheduling policy shouldn't throttle traffic when there's no contention. I'm not really familiar with the details of DRR to say how it behaves, and don't have time to dive into it now.
Do we need it to be weighted?
I'd say so. But the objective here is mostly to make sure consensus traffic isn't impacted by other traffic (e.g. that malicious peers can't DoS or otherwise affect validator nodes, and that consensus traffic flows when the network interface is overloaded).
Within a single peer's queue I think it's probably still worth to have priorities, but scheduling between peers is probably more important (which in the current design is left to the transport layer or OS network stack) -- this really depends on traffic patterns and DoS vectors so it's hard to say without traffic simulations or benchmarks.
However, traffic the other way (from many peers to a reactor) should probably be weighted -- e.g. a validator probably wants to prioritize traffic from other validator nodes over just random other peers that may be trying to DoS the node.
p2p/wdrr_queue.go
Outdated
| for _, chID := range q.chIDs { | ||
| flow := q.flows[chID] | ||
| if flow.getSize() > 0 { | ||
| d := flow.incrDeficit() |
There was a problem hiding this comment.
Not very familiar with DRR, but does this mean that it will throttle traffic even when there is contention? I.e. if there is a single message in a single flow (all others are empty), if the deficit < msgSize we'll wait? That seems unfortunate, since the typical scenario is no contention in which case we should neither drop nor delay messages to any significant extent.
|
So with a few tweaks, I've been able to get good consistent results, even with load, although the TPS tappers off drastically in terms of the # of txs included in blocks: Tweaks made:
|
|
So I've committed the most up-to-date implementation of the queue. I want to leave this PR in draft as-is to get more collaboration on it. The current summary is as follows:
|
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
| Priority: 5, | ||
| RecvMessageCapacity: batchMsg.Size(), | ||
|
|
||
| MaxSendBytes: 5000, |
There was a problem hiding this comment.
Not really. But they will be removed entirely assuming we ditch the WDRR approach.
There was a problem hiding this comment.
add a comment saying this? (nit)
There was a problem hiding this comment.
Saying what exactly?
cmwaters
left a comment
There was a problem hiding this comment.
LGTM 🎉 I just have a few comments to help my understanding
Do we need to add unit tests for the priority queue?
|
|
||
| // MaxSendBytes defines the maximum number of bytes that can be sent at any | ||
| // given moment from a Channel to a peer. | ||
| MaxSendBytes uint |
There was a problem hiding this comment.
Does this behave quite similarly to the priority?
Does it just throttle how much data is sent?
I see that the blockchain reactor v0 is only at 100 is that okay for something that sends entire blocks over the wire?
There was a problem hiding this comment.
The MaxSendBytes is used to determine the quanta for each flow/channel. To be honest, I'm not a huge fan of this and it's a bit complicated.
I'm leaning towards the simpler priority queue scheduler over WDRR. If do go with the priority queue approach, MaxSendBytes will be removed entirely.
| tmpSize -= pqEnvTmp.size | ||
|
|
||
| // start from the end again | ||
| i = s.pq.Len() - 1 |
There was a problem hiding this comment.
If the queue is ordered and you've just removed the one at the end (which has the lowest priority) than can't you just decrement i and continue to remove the next one until you reach an adequate tmpSize
There was a problem hiding this comment.
The queue is not ordered. It's a max heap implemented via a priority queue. At any given node, the only thing you can assume is that all descendants are of lower priority, but the underlying array/slice is not ordered.
|
|
||
| // dequeue | ||
|
|
||
| for s.pq.Len() > 0 { |
There was a problem hiding this comment.
Are we supposed to be popping a single envelope or all the envelopes in the queue?
There was a problem hiding this comment.
Yeah so here we pop off/dequeue and send on the dequeue channel as much as we can. This obviously creates a point of contention between enqueuing and dequeueing.
WDYT @tychoish ?
There was a problem hiding this comment.
Right so we send as many envelopes as possible until the channel blocks. Don't we want to do this before we enqueue a new envelope. Like perhaps the space that opens up when we dequeue allows for the new envelope to join the queue when beforehand we would've dropped it
There was a problem hiding this comment.
Or better still if the pq.Len() is 0 see if we can pop it on the dequeue channel straight away
There was a problem hiding this comment.
We send as much as there is, yes. If it blocks, it blocks, otherwise it sends.
I'm not sure the order matters as much as does the contention between enqueuing and dequeuing. We didn't see any performance degradation on idle or small load simple networks.
There was a problem hiding this comment.
Okay maybe you can't keep sending stuff to a buffered channel until it blocks. What I'm reading at the moment says that the code will panic with a fatal error. EDIT: Nvm - The article I read was wrong
There was a problem hiding this comment.
We have a lot of places where things can theoretically block (because channels) in some [admittedly easy to imagine] pathological cases.
All of these channels are buffered, which maybe gives a little bit of wiggle room, and maybe just makes the situation worse. I think the real point of contention is the bottleneck into/out of the heap (and it's mutex, likely,) which means, I think we can factor this anyway we want, and the heap is still the bottleneck. (e.g, having separate workers pulling incoming work and putting it on the heap.)
It's potentially true that the the buffer of the outgoing channel should always be 0 or 1 to prevent messages from getting stale in line, but we could definitely have threads consuming work from the queue waiting for work?
I can imagine ways to implement something with this functional component that would have less lock contention, or have "quicker" handling of messages, but I think the current implementation is definitely viable and given that the interface/implementations is entirely private, it seems worth waiting to see before [over]-engineering.
There was a problem hiding this comment.
Right, I'm all for simple first and [over]-engineering later but if this is the case can we make sure we document some of these thoughts so it's easier to pick up the conversation if we are to come back to it.

My initial findings are that the priority queue scheduler is both performant enough relative to the legacy stack and is much simpler to understand compared to the WDRR approach. I recommend keeping both, but a preference on the priority queue for now. Ultimately, we should land on one and remove the others along with the env var.
ref: #5670
/cc @tychoish