Skip to content

p2p: revised router message scheduling#6126

Merged
alexanderbez merged 27 commits intomasterfrom
bez/p2p-revise-queue-scheduler
Mar 25, 2021
Merged

p2p: revised router message scheduling#6126
alexanderbez merged 27 commits intomasterfrom
bez/p2p-revise-queue-scheduler

Conversation

@alexanderbez
Copy link
Contributor

@alexanderbez alexanderbez commented Feb 17, 2021

  • Introduce two message scheduling algorithms:
    • WDRR
    • Priority Queue
  • Add a env var to switch between scheduling algos (thanks @tychoish)
  • Add queue/scheduler factory (thanks @tychoish)
  • Add some new p2p metrics that are specific to the new scheduler logic
  • Some minor tweaks to e2e tests

My initial findings are that the priority queue scheduler is both performant enough relative to the legacy stack and is much simpler to understand compared to the WDRR approach. I recommend keeping both, but a preference on the priority queue for now. Ultimately, we should land on one and remove the others along with the env var.

ref: #5670

/cc @tychoish

@alexanderbez
Copy link
Contributor Author

image

A basic benchmark against the e2e simple network with load 50-75 txs/block before getting degraded. Eventually, the network is only able to handle 15-20 txs/block...not sure why, but I doubt its p2p related. From this, we see that reading off of peer queues before sending on the TCP connection takes the most time.

@codecov
Copy link

codecov bot commented Feb 19, 2021

Codecov Report

Merging #6126 (fc97795) into master (2ceb816) will decrease coverage by 0.00%.
The diff coverage is 72.47%.

@@            Coverage Diff             @@
##           master    #6126      +/-   ##
==========================================
- Coverage   60.76%   60.76%   -0.01%     
==========================================
  Files         279      281       +2     
  Lines       26321    26614     +293     
==========================================
+ Hits        15993    16171     +178     
- Misses       8651     8760     +109     
- Partials     1677     1683       +6     
Impacted Files Coverage Δ
blockchain/v0/reactor.go 63.20% <ø> (-1.60%) ⬇️
consensus/reactor.go 67.86% <ø> (-2.04%) ⬇️
evidence/reactor.go 72.80% <ø> (ø)
mempool/reactor.go 74.46% <0.00%> (-1.97%) ⬇️
p2p/conn/connection.go 78.51% <ø> (-0.56%) ⬇️
statesync/reactor.go 57.01% <ø> (ø)
p2p/pqueue.go 26.60% <26.60%> (ø)
node/node.go 59.01% <86.36%> (+0.46%) ⬆️
p2p/router.go 80.21% <90.00%> (+1.92%) ⬆️
p2p/metrics.go 98.64% <100.00%> (+1.21%) ⬆️
... and 18 more

@alexanderbez
Copy link
Contributor Author

alexanderbez commented Feb 22, 2021

@erikgrinaker would love your thoughts on the following.

I've spent quite some time researching network and process queuing disciplines and their various tradeoffs. I have no means come close to understanding all the disciplines and their tradeoffs, nor do I really want to spend too much time on doing so.

I've arrived at different options that we can take from here given the nature our now legacy p2p stack (i.e. priorities) for outbound message to peers:

  1. Each peer's queue has a simple priority heap as the queue, where the capacity is fixed and envelope priority in the heap is determined by the corresponding channel priority. When the queue is full, messages are dropped. Or,
  2. Use (weighted?) deficit round robin [DRR]. This is an adaptation of GPS and is relatively simple to understand and implement. Each peer would have a queue per flow (i.e. channel) and we would either pick a fixed quantum per flow or perhaps a quantum per flow based on that channel's priority. When any given queue is full, messages are dropped.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Feb 22, 2021

I definitely agree that we should start out with something simple. Both of these approaches seem viable, but I think I'd prefer DRR over a priority heap -- the heap always picks a higher priority, which can cause high-priority reactors to completely starve low-priority ones, and I think it makes sense to let low-priority reactors at least send some messages during contention.

I was going to suggest WFQ, but DWRR appears roughly equivalent and takes into account payload sizes as well, which is useful since we're scheduling at message granularity and a 10MB state sync chunk shouldn't count the same as a 1KB consensus vote. The plan was originally to schedule at the byte level, but this would require the stream-based transport API which has been deferred -- when/if this is implemented, we should rework the scheduling for this. Also note that because of the large message size variance, queue capacities should probably be in terms of bytes rather than number of messages (or possibly both), using proto.Size() to calculate message sizes.

You may want to have a look at the MConnection queue scheduler as well, which uses priority combined with the time since last packet send:

// Choose a channel to create a PacketMsg from.
// The chosen channel will be the one whose recentlySent/priority is the least.
var leastRatio float32 = math.MaxFloat32
var leastChannel *Channel
for _, channel := range c.channels {
// If nothing to send, skip this channel
if !channel.isSendPending() {
continue
}
// Get ratio, and keep track of lowest ratio.
ratio := float32(channel.recentlySent) / float32(channel.desc.Priority)
if ratio < leastRatio {
leastRatio = ratio
leastChannel = channel
}
}

@alexanderbez
Copy link
Contributor Author

Thanks for the input @erikgrinaker. I believe I have enough info to get started on this now.

@alexanderbez
Copy link
Contributor Author

alexanderbez commented Feb 26, 2021

@erikgrinaker I was hoping you could spare a few minutes and take a look at the queue implementation. Specifically in regards to weights, priorities and the individual flow buffer sizes.

I'm having difficulties mapping old priorities and max message sizes to the new queue logic. In particular I see idle block production slow down to ~10s no matter what values I pick.


EDIT: I only see good idle or with load block times when I make it "fair", i.e. same quantum and buffer capacity for all channels/flows. Either my implementation is wrong or picking quantums based on "priority" is going to be very difficult.

Do we need it to be weighted?

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having difficulties mapping old priorities and max message sizes to the new queue logic. In particular I see idle block production slow down to ~10s no matter what values I pick.

The simple E2E network (or any of them really) shouldn't produce anywhere near enough traffic to have any contention, so the scheduler should only have a negligible effect on latency. If the E2E test times vary to a statistically significant extent with the WDRR scheduler then the scheduler needs to be improved.

I only see good idle or with load block times when I make it "fair", i.e. same quantum and buffer capacity for all channels/flows. Either my implementation is wrong or picking quantums based on "priority" is going to be very difficult.

Sounds like there's probably something wrong with the scheduler -- but also note (as mentioned elsewhere) that the scheduling policy shouldn't throttle traffic when there's no contention. I'm not really familiar with the details of DRR to say how it behaves, and don't have time to dive into it now.

Do we need it to be weighted?

I'd say so. But the objective here is mostly to make sure consensus traffic isn't impacted by other traffic (e.g. that malicious peers can't DoS or otherwise affect validator nodes, and that consensus traffic flows when the network interface is overloaded).

Within a single peer's queue I think it's probably still worth to have priorities, but scheduling between peers is probably more important (which in the current design is left to the transport layer or OS network stack) -- this really depends on traffic patterns and DoS vectors so it's hard to say without traffic simulations or benchmarks.

However, traffic the other way (from many peers to a reactor) should probably be weighted -- e.g. a validator probably wants to prioritize traffic from other validator nodes over just random other peers that may be trying to DoS the node.

for _, chID := range q.chIDs {
flow := q.flows[chID]
if flow.getSize() > 0 {
d := flow.incrDeficit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very familiar with DRR, but does this mean that it will throttle traffic even when there is contention? I.e. if there is a single message in a single flow (all others are empty), if the deficit < msgSize we'll wait? That seems unfortunate, since the typical scenario is no contention in which case we should neither drop nor delay messages to any significant extent.

@alexanderbez
Copy link
Contributor Author

alexanderbez commented Mar 1, 2021

So with a few tweaks, I've been able to get good consistent results, even with load, although the TPS tappers off drastically in terms of the # of txs included in blocks:

Tweaks made:

  • Single synchronization goroutine (i.e. no mutex).
    • This requires large buffered enqueue and dequeue channels (e.g. 1000) so enqueueing doesn't hold up the dequeue loop and vice versa.
  • Single shared buffer that all flows share with a fixed capacity (used 1MB in tests -- should probably be larger?).
  • Quanta are based on average message sizes multiplied by priority.

@alexanderbez
Copy link
Contributor Author

So I've committed the most up-to-date implementation of the queue. I want to leave this PR in draft as-is to get more collaboration on it. The current summary is as follows:

  • A single goroutine scheduler is used in the WDRR scheduler/queue. See the XXX/TODO here. It's not clear what kind of contention this causes or if its the ideal approach. Alternatively we could explore separate goroutines with a mutex.
  • With the current values and implementation, a bare bones simple e2e network runs fine, but I notice that for the first few blocks (e.g. 5-10), 1/4 nodes lags. After block 10 or so, it runs pretty normal with avg block time of ~2.5s and all nodes in unison.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale for use by stalebot label Mar 14, 2021
@alexanderbez alexanderbez removed the stale for use by stalebot label Mar 14, 2021
@alexanderbez alexanderbez changed the title p2p: improve router queue message scheduling p2p: revised router message scheduling Mar 23, 2021
@alexanderbez alexanderbez marked this pull request as ready for review March 23, 2021 14:47
@alexanderbez alexanderbez requested a review from tychoish March 23, 2021 14:47
@alexanderbez alexanderbez added the C:p2p Component: P2P pkg label Mar 23, 2021
@alexanderbez alexanderbez requested a review from tac0turtle March 23, 2021 14:48
Priority: 5,
RecvMessageCapacity: batchMsg.Size(),

MaxSendBytes: 5000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these arbitrary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. But they will be removed entirely assuming we ditch the WDRR approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment saying this? (nit)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saying what exactly?

Copy link
Contributor

@cmwaters cmwaters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉 I just have a few comments to help my understanding

Do we need to add unit tests for the priority queue?


// MaxSendBytes defines the maximum number of bytes that can be sent at any
// given moment from a Channel to a peer.
MaxSendBytes uint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this behave quite similarly to the priority?

Does it just throttle how much data is sent?

I see that the blockchain reactor v0 is only at 100 is that okay for something that sends entire blocks over the wire?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MaxSendBytes is used to determine the quanta for each flow/channel. To be honest, I'm not a huge fan of this and it's a bit complicated.

I'm leaning towards the simpler priority queue scheduler over WDRR. If do go with the priority queue approach, MaxSendBytes will be removed entirely.

tmpSize -= pqEnvTmp.size

// start from the end again
i = s.pq.Len() - 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the queue is ordered and you've just removed the one at the end (which has the lowest priority) than can't you just decrement i and continue to remove the next one until you reach an adequate tmpSize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The queue is not ordered. It's a max heap implemented via a priority queue. At any given node, the only thing you can assume is that all descendants are of lower priority, but the underlying array/slice is not ordered.


// dequeue

for s.pq.Len() > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we supposed to be popping a single envelope or all the envelopes in the queue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so here we pop off/dequeue and send on the dequeue channel as much as we can. This obviously creates a point of contention between enqueuing and dequeueing.

WDYT @tychoish ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right so we send as many envelopes as possible until the channel blocks. Don't we want to do this before we enqueue a new envelope. Like perhaps the space that opens up when we dequeue allows for the new envelope to join the queue when beforehand we would've dropped it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better still if the pq.Len() is 0 see if we can pop it on the dequeue channel straight away

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We send as much as there is, yes. If it blocks, it blocks, otherwise it sends.

I'm not sure the order matters as much as does the contention between enqueuing and dequeuing. We didn't see any performance degradation on idle or small load simple networks.

Copy link
Contributor

@cmwaters cmwaters Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay maybe you can't keep sending stuff to a buffered channel until it blocks. What I'm reading at the moment says that the code will panic with a fatal error. EDIT: Nvm - The article I read was wrong

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a lot of places where things can theoretically block (because channels) in some [admittedly easy to imagine] pathological cases.

All of these channels are buffered, which maybe gives a little bit of wiggle room, and maybe just makes the situation worse. I think the real point of contention is the bottleneck into/out of the heap (and it's mutex, likely,) which means, I think we can factor this anyway we want, and the heap is still the bottleneck. (e.g, having separate workers pulling incoming work and putting it on the heap.)

It's potentially true that the the buffer of the outgoing channel should always be 0 or 1 to prevent messages from getting stale in line, but we could definitely have threads consuming work from the queue waiting for work?

I can imagine ways to implement something with this functional component that would have less lock contention, or have "quicker" handling of messages, but I think the current implementation is definitely viable and given that the interface/implementations is entirely private, it seems worth waiting to see before [over]-engineering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm all for simple first and [over]-engineering later but if this is the case can we make sure we document some of these thoughts so it's easier to pick up the conversation if we are to come back to it.

@alexanderbez alexanderbez merged commit a554005 into master Mar 25, 2021
@alexanderbez alexanderbez deleted the bez/p2p-revise-queue-scheduler branch March 25, 2021 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C:p2p Component: P2P pkg

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants