p2p: revised router message scheduling by alexanderbez · Pull Request #6126 · tendermint/tendermint

alexanderbez · 2021-02-17T18:39:18Z

Introduce two message scheduling algorithms:
- WDRR
- Priority Queue
Add a env var to switch between scheduling algos (thanks @tychoish)
Add queue/scheduler factory (thanks @tychoish)
Add some new p2p metrics that are specific to the new scheduler logic
Some minor tweaks to e2e tests

My initial findings are that the priority queue scheduler is both performant enough relative to the legacy stack and is much simpler to understand compared to the WDRR approach. I recommend keeping both, but a preference on the priority queue for now. Ultimately, we should land on one and remove the others along with the env var.

ref: #5670

/cc @tychoish

alexanderbez · 2021-02-17T19:46:33Z

A basic benchmark against the e2e simple network with load 50-75 txs/block before getting degraded. Eventually, the network is only able to handle 15-20 txs/block...not sure why, but I doubt its p2p related. From this, we see that reading off of peer queues before sending on the TCP connection takes the most time.

codecov · 2021-02-19T20:24:46Z

Codecov Report

Merging #6126 (fc97795) into master (2ceb816) will decrease coverage by 0.00%.
The diff coverage is 72.47%.

@@            Coverage Diff             @@
##           master    #6126      +/-   ##
==========================================
- Coverage   60.76%   60.76%   -0.01%     
==========================================
  Files         279      281       +2     
  Lines       26321    26614     +293     
==========================================
+ Hits        15993    16171     +178     
- Misses       8651     8760     +109     
- Partials     1677     1683       +6

Impacted Files	Coverage Δ
blockchain/v0/reactor.go	`63.20% <ø> (-1.60%)`	⬇️
consensus/reactor.go	`67.86% <ø> (-2.04%)`	⬇️
evidence/reactor.go	`72.80% <ø> (ø)`
mempool/reactor.go	`74.46% <0.00%> (-1.97%)`	⬇️
p2p/conn/connection.go	`78.51% <ø> (-0.56%)`	⬇️
statesync/reactor.go	`57.01% <ø> (ø)`
p2p/pqueue.go	`26.60% <26.60%> (ø)`
node/node.go	`59.01% <86.36%> (+0.46%)`	⬆️
p2p/router.go	`80.21% <90.00%> (+1.92%)`	⬆️
p2p/metrics.go	`98.64% <100.00%> (+1.21%)`	⬆️
... and 18 more

alexanderbez · 2021-02-22T01:18:58Z

@erikgrinaker would love your thoughts on the following.

I've spent quite some time researching network and process queuing disciplines and their various tradeoffs. I have no means come close to understanding all the disciplines and their tradeoffs, nor do I really want to spend too much time on doing so.

I've arrived at different options that we can take from here given the nature our now legacy p2p stack (i.e. priorities) for outbound message to peers:

Each peer's queue has a simple priority heap as the queue, where the capacity is fixed and envelope priority in the heap is determined by the corresponding channel priority. When the queue is full, messages are dropped. Or,
Use (weighted?) deficit round robin [DRR]. This is an adaptation of GPS and is relatively simple to understand and implement. Each peer would have a queue per flow (i.e. channel) and we would either pick a fixed quantum per flow or perhaps a quantum per flow based on that channel's priority. When any given queue is full, messages are dropped.

erikgrinaker · 2021-02-22T11:56:03Z

I definitely agree that we should start out with something simple. Both of these approaches seem viable, but I think I'd prefer DRR over a priority heap -- the heap always picks a higher priority, which can cause high-priority reactors to completely starve low-priority ones, and I think it makes sense to let low-priority reactors at least send some messages during contention.

I was going to suggest WFQ, but DWRR appears roughly equivalent and takes into account payload sizes as well, which is useful since we're scheduling at message granularity and a 10MB state sync chunk shouldn't count the same as a 1KB consensus vote. The plan was originally to schedule at the byte level, but this would require the stream-based transport API which has been deferred -- when/if this is implemented, we should rework the scheduling for this. Also note that because of the large message size variance, queue capacities should probably be in terms of bytes rather than number of messages (or possibly both), using proto.Size() to calculate message sizes.

You may want to have a look at the MConnection queue scheduler as well, which uses priority combined with the time since last packet send:

tendermint/p2p/conn/connection.go

Lines 521 to 536 in 5a39f72

    
           // Choose a channel to create a PacketMsg from. 
        
           // The chosen channel will be the one whose recentlySent/priority is the least. 
        
           var leastRatio float32 = math.MaxFloat32 
        
           var leastChannel *Channel 
        
           for _, channel := range c.channels { 
        
           	// If nothing to send, skip this channel 
        
           	if !channel.isSendPending() { 
        
           		continue 
        
           	} 
        
           	// Get ratio, and keep track of lowest ratio. 
        
           	ratio := float32(channel.recentlySent) / float32(channel.desc.Priority) 
        
           	if ratio < leastRatio { 
        
           		leastRatio = ratio 
        
           		leastChannel = channel 
        
           	} 
        
           }

alexanderbez · 2021-02-22T12:50:30Z

Thanks for the input @erikgrinaker. I believe I have enough info to get started on this now.

alexanderbez · 2021-02-26T20:16:30Z

@erikgrinaker I was hoping you could spare a few minutes and take a look at the queue implementation. Specifically in regards to weights, priorities and the individual flow buffer sizes.

I'm having difficulties mapping old priorities and max message sizes to the new queue logic. In particular I see idle block production slow down to ~10s no matter what values I pick.

EDIT: I only see good idle or with load block times when I make it "fair", i.e. same quantum and buffer capacity for all channels/flows. Either my implementation is wrong or picking quantums based on "priority" is going to be very difficult.

Do we need it to be weighted?

p2p/wdrr_queue.go

erikgrinaker

I'm having difficulties mapping old priorities and max message sizes to the new queue logic. In particular I see idle block production slow down to ~10s no matter what values I pick.

The simple E2E network (or any of them really) shouldn't produce anywhere near enough traffic to have any contention, so the scheduler should only have a negligible effect on latency. If the E2E test times vary to a statistically significant extent with the WDRR scheduler then the scheduler needs to be improved.

I only see good idle or with load block times when I make it "fair", i.e. same quantum and buffer capacity for all channels/flows. Either my implementation is wrong or picking quantums based on "priority" is going to be very difficult.

Sounds like there's probably something wrong with the scheduler -- but also note (as mentioned elsewhere) that the scheduling policy shouldn't throttle traffic when there's no contention. I'm not really familiar with the details of DRR to say how it behaves, and don't have time to dive into it now.

Do we need it to be weighted?

I'd say so. But the objective here is mostly to make sure consensus traffic isn't impacted by other traffic (e.g. that malicious peers can't DoS or otherwise affect validator nodes, and that consensus traffic flows when the network interface is overloaded).

Within a single peer's queue I think it's probably still worth to have priorities, but scheduling between peers is probably more important (which in the current design is left to the transport layer or OS network stack) -- this really depends on traffic patterns and DoS vectors so it's hard to say without traffic simulations or benchmarks.

However, traffic the other way (from many peers to a reactor) should probably be weighted -- e.g. a validator probably wants to prioritize traffic from other validator nodes over just random other peers that may be trying to DoS the node.

p2p/wdrr_queue.go

erikgrinaker · 2021-02-28T16:37:45Z

p2p/wdrr_queue.go

+			for _, chID := range q.chIDs {
+				flow := q.flows[chID]
+				if flow.getSize() > 0 {
+					d := flow.incrDeficit()


Not very familiar with DRR, but does this mean that it will throttle traffic even when there is contention? I.e. if there is a single message in a single flow (all others are empty), if the deficit < msgSize we'll wait? That seems unfortunate, since the typical scenario is no contention in which case we should neither drop nor delay messages to any significant extent.

alexanderbez · 2021-03-01T21:45:29Z

So with a few tweaks, I've been able to get good consistent results, even with load, although the TPS tappers off drastically in terms of the # of txs included in blocks:

Tweaks made:

Single synchronization goroutine (i.e. no mutex).
- This requires large buffered enqueue and dequeue channels (e.g. 1000) so enqueueing doesn't hold up the dequeue loop and vice versa.
Single shared buffer that all flows share with a fixed capacity (used 1MB in tests -- should probably be larger?).
Quanta are based on average message sizes multiplied by priority.

alexanderbez · 2021-03-03T13:39:54Z

So I've committed the most up-to-date implementation of the queue. I want to leave this PR in draft as-is to get more collaboration on it. The current summary is as follows:

A single goroutine scheduler is used in the WDRR scheduler/queue. See the XXX/TODO here. It's not clear what kind of contention this causes or if its the ideal approach. Alternatively we could explore separate goroutines with a mutex.
With the current values and implementation, a bare bones simple e2e network runs fine, but I notice that for the first few blocks (e.g. 5-10), 1/4 nodes lags. After block 10 or so, it runs pretty normal with avg block time of ~2.5s and all nodes in unison.

github-actions · 2021-03-14T00:42:21Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tac0turtle · 2021-03-24T14:29:49Z

mempool/reactor.go

 				Priority:            5,
 				RecvMessageCapacity: batchMsg.Size(),
+
+				MaxSendBytes: 5000,


are these arbitrary?

Not really. But they will be removed entirely assuming we ditch the WDRR approach.

add a comment saying this? (nit)

Saying what exactly?

cmwaters

LGTM 🎉 I just have a few comments to help my understanding

Do we need to add unit tests for the priority queue?

cmwaters · 2021-03-24T15:59:21Z

p2p/conn/connection.go

+
+	// MaxSendBytes defines the maximum number of bytes that can be sent at any
+	// given moment from a Channel to a peer.
+	MaxSendBytes uint


Does this behave quite similarly to the priority?

Does it just throttle how much data is sent?

I see that the blockchain reactor v0 is only at 100 is that okay for something that sends entire blocks over the wire?

The MaxSendBytes is used to determine the quanta for each flow/channel. To be honest, I'm not a huge fan of this and it's a bit complicated.

I'm leaning towards the simpler priority queue scheduler over WDRR. If do go with the priority queue approach, MaxSendBytes will be removed entirely.

p2p/pqueue.go

cmwaters · 2021-03-24T17:26:40Z

p2p/pqueue.go

+								tmpSize -= pqEnvTmp.size
+
+								// start from the end again
+								i = s.pq.Len() - 1


If the queue is ordered and you've just removed the one at the end (which has the lowest priority) than can't you just decrement i and continue to remove the next one until you reach an adequate tmpSize

The queue is not ordered. It's a max heap implemented via a priority queue. At any given node, the only thing you can assume is that all descendants are of lower priority, but the underlying array/slice is not ordered.

cmwaters · 2021-03-25T09:10:45Z

p2p/pqueue.go

+
+			// dequeue
+
+			for s.pq.Len() > 0 {


Are we supposed to be popping a single envelope or all the envelopes in the queue?

Yeah so here we pop off/dequeue and send on the dequeue channel as much as we can. This obviously creates a point of contention between enqueuing and dequeueing.

WDYT @tychoish ?

Right so we send as many envelopes as possible until the channel blocks. Don't we want to do this before we enqueue a new envelope. Like perhaps the space that opens up when we dequeue allows for the new envelope to join the queue when beforehand we would've dropped it

Or better still if the pq.Len() is 0 see if we can pop it on the dequeue channel straight away

We send as much as there is, yes. If it blocks, it blocks, otherwise it sends.

I'm not sure the order matters as much as does the contention between enqueuing and dequeuing. We didn't see any performance degradation on idle or small load simple networks.

~~Okay maybe you can't keep sending stuff to a buffered channel until it blocks. What I'm reading at the moment says that the code will panic with a fatal error~~. EDIT: Nvm - The article I read was wrong

We have a lot of places where things can theoretically block (because channels) in some [admittedly easy to imagine] pathological cases.

All of these channels are buffered, which maybe gives a little bit of wiggle room, and maybe just makes the situation worse. I think the real point of contention is the bottleneck into/out of the heap (and it's mutex, likely,) which means, I think we can factor this anyway we want, and the heap is still the bottleneck. (e.g, having separate workers pulling incoming work and putting it on the heap.)

It's potentially true that the the buffer of the outgoing channel should always be 0 or 1 to prevent messages from getting stale in line, but we could definitely have threads consuming work from the queue waiting for work?

I can imagine ways to implement something with this functional component that would have less lock contention, or have "quicker" handling of messages, but I think the current implementation is definitely viable and given that the interface/implementations is entirely private, it seems worth waiting to see before [over]-engineering.

Right, I'm all for simple first and [over]-engineering later but if this is the case can we make sure we document some of these thoughts so it's easier to pick up the conversation if we are to come back to it.

p2p/queue.go

p2p: add router metrics

8e32f0a

alexanderbez mentioned this pull request Feb 17, 2021

p2p: internal refactor and architecture redesign #5670

Closed

57 tasks

alexanderbez added 2 commits February 18, 2021 08:27

Merge branch 'master' into bez/p2p-revise-queue-scheduler

9ea44c8

Merge branch 'master' into bez/p2p-revise-queue-scheduler

ebc5a0e

p2p: router linting

2f5d23c

alexanderbez added 7 commits February 23, 2021 12:48

p2p: add WDRR queue impl

0689db4

p2p: TestWDRRQueue_EqualWeights

49f8db6

p2p: TestWDRRQueue_DecreasingWeights

079a768

p2p: setup wiring + metrics

f00e4ad

p2p: change receive channel buffers

49a9a49

p2p: update channel descriptor

16357ef

p2p: update consensus channel descriptors

2eb97ec

alexanderbez commented Feb 26, 2021

View reviewed changes

p2p/wdrr_queue.go Outdated Show resolved Hide resolved

alexanderbez commented Feb 27, 2021

View reviewed changes

p2p/wdrr_queue.go Outdated Show resolved Hide resolved

erikgrinaker reviewed Feb 28, 2021

View reviewed changes

alexanderbez added 4 commits March 2, 2021 17:54

p2p: refactor wdrr scheduler

cf24fd3

e2e: memoize RPC client in Node

a6e8b73

e2e: support load multiplier

9ee899e

p2p: update reactor quanta

73d3b8a

github-actions bot added the stale for use by stalebot label Mar 14, 2021

alexanderbez removed the stale for use by stalebot label Mar 14, 2021

tac0turtle assigned alexanderbez Mar 17, 2021

alexanderbez and others added 4 commits March 22, 2021 16:33

p2p: message queue alternative w/ priority queue (#6211)

e05b594

Merge branch 'master' into bez/p2p-revise-queue-scheduler

d6f8303

lint++

bfbbe77

e2e: fix load cmd

3df88cc

alexanderbez changed the title ~~p2p: improve router queue message scheduling~~ p2p: revised router message scheduling Mar 23, 2021

alexanderbez marked this pull request as ready for review March 23, 2021 14:47

alexanderbez requested review from cmwaters, ebuchman and tessr as code owners March 23, 2021 14:47

alexanderbez requested a review from tychoish March 23, 2021 14:47

alexanderbez added the C:p2p Component: P2P pkg label Mar 23, 2021

alexanderbez requested a review from tac0turtle March 23, 2021 14:48

e2e: revert testnet logic

9bb07db

tychoish approved these changes Mar 23, 2021

View reviewed changes

alexanderbez added 3 commits March 23, 2021 17:11

fix tests

d57ae58

change formatter

d566908

update test log

e4527bf

tac0turtle reviewed Mar 24, 2021

View reviewed changes

alexanderbez added 2 commits March 24, 2021 11:00

mempool: attempt test fix

73e2a78

mempool: attempt test fix

e666b10

cmwaters approved these changes Mar 24, 2021

View reviewed changes

cmwaters reviewed Mar 24, 2021

View reviewed changes

p2p: fix total var

3865f41

cmwaters reviewed Mar 25, 2021

View reviewed changes

p2p/queue.go Show resolved Hide resolved

Merge branch 'master' into bez/p2p-revise-queue-scheduler

fc97795

alexanderbez merged commit a554005 into master Mar 25, 2021

alexanderbez deleted the bez/p2p-revise-queue-scheduler branch March 25, 2021 20:58

Conversation

alexanderbez commented Feb 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexanderbez commented Feb 17, 2021

Uh oh!

codecov bot commented Feb 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alexanderbez commented Feb 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erikgrinaker commented Feb 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexanderbez commented Feb 22, 2021

Uh oh!

alexanderbez commented Feb 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erikgrinaker left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexanderbez commented Mar 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexanderbez commented Mar 3, 2021

Uh oh!

github-actions bot commented Mar 14, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmwaters left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmwaters Mar 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

alexanderbez commented Feb 17, 2021 •

edited

Loading

codecov bot commented Feb 19, 2021 •

edited

Loading

alexanderbez commented Feb 22, 2021 •

edited

Loading

erikgrinaker commented Feb 22, 2021 •

edited

Loading

alexanderbez commented Feb 26, 2021 •

edited

Loading

erikgrinaker left a comment •

edited

Loading

alexanderbez commented Mar 1, 2021 •

edited

Loading

cmwaters Mar 25, 2021 •

edited

Loading