Conversation
6cbe869 to
8acb946
Compare
9cf7aea to
77372c3
Compare
d4e98d5 to
33c8262
Compare
src/overlay/FlowControlCapacity.cpp
Outdated
There was a problem hiding this comment.
will need to update the Visual Studio project file before merging
src/overlay/Peer.cpp
Outdated
There was a problem hiding this comment.
Why do we need mFlowControlMessages->getOutboundCapacity() > 0 ?
There was a problem hiding this comment.
We don't, message outbound capacity is guaranteed to always be positive (I initially made this change when I made outbound capacity int64 instead of uint64, but forgot to remove it). Removed it.
src/overlay/FlowControlCapacity.cpp
Outdated
There was a problem hiding this comment.
We should not allow to go beyond capacity, ie code should be
auto capacity = getMsgResourceCount(msg);
releaseAssert(mOutboundCapacity > capacity);
mOutboundCapacity -= capacity;
There was a problem hiding this comment.
or follow tryLock semantics like in lockLocalCapacity
src/overlay/FlowControlCapacity.cpp
Outdated
There was a problem hiding this comment.
this should be cleaned up as part of this refactor (I am not really sure why it's done like this in the first place):
lockLocalCapacityshould be calledtryLockLocalCapacity- we should not update capacity if we can't lock
- capacity should never get negative (this points to not having a clean definition of what we're trying to enforce)
There was a problem hiding this comment.
Addressing both this and the other comment around tryLock: I don't believe the proposed approach is correct. I intentionally implemented it this way, because right now messages take up arbitrary capacity that the receiver does not know ahead of time.
Consider a simple scenario: the receiver sets its batch size to 100 bytes, and asks the sender for 100 bytes worth of traffic. The sender has the next SCP message to send of size 300 bytes. Unless we allow capacity to temporarily go negative, the system deadlocks. To unblock it, we allow going past capacity for up to (message_size - 1).
Note that we wouldn't need to do this if we implemented flow control based on uniformly sized batches (no need to go negative if we know max batch size ahead of time), but I'm not doing it in this iteration because (a) benefits of this approach are unclear and (b) seems like it'll add more complexity to the code since we'll need to deconstruct/reconstruct messages, reconcile when to request more data based on complete messages vs batches received, etc
There was a problem hiding this comment.
I am not sure I understand the deadlock comment, that sounds more like a limitation of the design that I imagine we can overcome? In your example, we could just wait to receive more "send more" messages until there is more than 300?
Accounting in bytes or in terms of messages (where we already send multiple messages until we use up capacity) should behave the same way: if a peer sends us too much data, that means it's not respecting flow control, right?
Reason I flagged this in the first place is that if you have capacity left at 100 bytes we allow to receive a 100KB message (or whatever the max message size is) but we would block after receiving a small (200 bytes) transaction which means we'd allow ~2x the configured quota for large messages.
There was a problem hiding this comment.
I am not sure I understand the deadlock comment, that sounds more like a limitation of the design that I imagine we can overcome? In your example, we could just wait to receive more "send more" messages until there is more than 300?
Send more is triggered when capacity is released. This means that the node needs to receive a 100 bytes worth of data before requesting more data. If it doesn't do it and requests more data arbitrarily, then that defeats the purpose of flow control. I'm not sure I understand what you're proposing, but I think it's the same thing as the current implementation (except we shouldn't send arbitrary "send more"): at the end, node receives 300 bytes when it only has 100 bytes of capacity to spare. This limitation comes from the fact that we transmit full messages, and we don't know their sizes ahead of time.
There was a problem hiding this comment.
I looked into this further after our chat, and even sketched the change, bit still found several issues:
- We can't apply "capacity is always positive" reasoning to reading capacity as we don't know what non-flood traffic a peer is sending us and it's not controlled with SEND_MORE messages (we only define max message size for flood traffic, like transactions).
- I still find it hard to reason about the proposal and the relationship between capacity and batch size - specifically, I can't tell if this logic will actually unblock nodes in all possible situations, with all possible combinations of messages in the queue. I find it easier to reason about the "capacity goes negative" approach as we we allow at most one more message to be sent to unblock the node.
- With this change, we're re-purposing the batch flag as now there are strict requirements for it vs capacity. I don't really like that we're taking a flag that was only meant to be there to improve perf and make it part of correctness requirements. I think this will also cause confusion for the operators if they decide to change their configs. I think it'll be easier to communicate that whatever capacity they pick implies up to (capacity + max_message_size - 1) max capacity.
Let me know what you think, but I'd like to re-visit the currently implemented approach, as it seems easier to reason about
There was a problem hiding this comment.
We can't apply "capacity is always positive" reasoning to reading capacity as we don't know what non-flood traffic a peer is sending us and it's not controlled with SEND_MORE messages (we only define max message size for flood traffic, like transactions).
Oh I see, I missed this - what is the point of the global in bytes reading capacity?
I think the reason we want to track in bytes is to avoid network congestion, and this does not contribute to this.
This one is more for DoS protection then right? If so, we should just have enough capacity for the largest message possible with some buffer on top - so something in the order of MAX_MESSAGE_SIZE*5 (right now MAX_MESSAGE_SIZE is 16MB btw).
I still find it hard to reason about the proposal and the relationship between capacity and batch size - specifically, I can't tell if this logic will actually unblock nodes in all possible situations, with all possible combinations of messages in the queue. I find it easier to reason about the "capacity goes negative" approach as we we allow at most one more message to be sent to unblock the node.
We went over this offline already.
I actually find it harder to reason when we go negative because we have implicit limits.
In the current version for example, you have PEER_READING_CAPACITY_BYTES of about 100KB but you can read 16MB of data.
Capacity & limits should be explicit so that we can properly reason about resource utilization, right now based on settings it looks like we would only about 100KB of memory per connection but this is way off.
Same thing applies to what is "on the wire": when modeling based on bandwidth numbers, you really want to configure based on the "real" numbers.
With this change, we're re-purposing the batch flag as now there are strict requirements for it vs capacity. I don't really like that we're taking a flag that was only meant to be there to improve perf and make it part of correctness requirements. I think this will also cause confusion for the operators if they decide to change their configs. I think it'll be easier to communicate that whatever capacity they pick implies up to (capacity + max_message_size - 1) max capacity.
Those settings were always "advanced" in a way, because what we really want to control with them is max latency. I don't think making additional checks on startup to ensure that people don't do bad things is that special.
I agree that we need to make configuration as simple as possible. We went back and forth on how to configure batch size in the original flow control PR (if you remember we had things expressed as a fraction of the capacity at some point).
This is related to my previous comment on what is easier to reason about for people, in the context of controlling latency, I think that expressing that in terms of true bytes makes sense (because if people know the bandwidth they have, they can do the math on how to allocate it).
We could look into better ways to express those settings (but I don't think anybody is really going to try to configure those things).
There was a problem hiding this comment.
I think I am not following the discussion in this thread. Can you briefly explain the steps in 300K message + with capacity going negative approach?
There was a problem hiding this comment.
We are implementing traffic shaping but in a different way. I think this as a way of some sort of bandwidth limit on top of message count instead. Message size can be arbitrary and we do not divide messages into multiple iterations. For example, if capacity size is 100 and we have already received two messages with 40 and 50, and the next message will be 30. Receiver does not know the size beforehand but still need to ask more because there is 10 available capacity; and hence we accept 30 and make it -20 and now stop asking more messages.
There was a problem hiding this comment.
I believe this PR serves its purpose for now. For future as an improvement,we can introduce a burst size in addition to capacity size, and that may help us make it better represented.
There was a problem hiding this comment.
I made some changes that should hopefully help unblock this thread:
- Capacity is now always positive, but there's a requirement for capacity vs maximum flood message size (what we discussed offline)
- Byte accounting does not apply to reading capacity (for now). This is for simplicity, as currently maximum message size is really big, so I'd like to think about the best way to protect against DoS with large messages. We can still move this change forward, and add reading capacity incrementally (it's a purely internal change, so can be added separately without a version bump)
src/overlay/Peer.cpp
Outdated
There was a problem hiding this comment.
I am having a hard time understanding how we'd set this limit:
- is "10 MB" too high or too low?
- other limit is expressed relative to ledger size, which is easy to understand
- there is a lower bound that comes from business requirements
- the limit itself is trying to control for latency? something else?
Right now that limit seems too high, but I don't fully understand what we're trying to do yet and if a simple limit like this is enough.
There was a problem hiding this comment.
The limit here is related to memory DoS:
- Not sure how we can reason about ledger size here, because we don't know transaction size distribution per ledger. 1000 ops of 300 bytes is very different from 1000 ops of 128kb.
- I'm assuming transactions of arbitrary sizes can end up in this queue. Because the variance is too big, instead I decided to put a limit that is high enough in a sense that it'll still likely allow most/all transactions, but in some degenerate cases we don't blow up our memory (large txs add up fast with the current limit of 1000 - 128kb*1000=128MB per just one peer).
- I'm not sure it's too high - we're still enforcing 1000 txs per queue limit, this limit is just to prevent the queue from growing too big.
There was a problem hiding this comment.
My concern here is that this limit is fine for protecting against out of memory DoS, but it may still trigger a lot of badness: I think what we discussed in a different context was maybe moving to latency based queue policies that would avoid creating "spiral of death" situations. 10MB seems like a lot of data to move especially if there are a lot of connections competing against each other (100 peers type of situation -> 1GB of stuff queued up to be sent)
There was a problem hiding this comment.
I'd like to move away from reasoning that ~100 peers is the norm. We have work in flight that actively tries to reduce this number by a lot. Even without it, a more realistic number based on pubnet data is ~50-60. That being said, I agree that we may still create too much overhead.
What I was thinking we could do is use the configs for "max classic ops" vs "max Soroban ops" per ledger to calculate this limit. The reason I didn't do it is because we don't have those configs yet, and we don't even know what the values would be. Hence the static limit for now. I think we can change it in the next iteration once we have a better idea about what the distribution is going to be like. In the meantime, I'll reduce the limit to something reasonable, like 3MB.
33c8262 to
cd5c2c1
Compare
src/main/Config.h
Outdated
There was a problem hiding this comment.
nit: FLOW_CONTROL_SEND_MORE_BATCH_SIZE_BYTES bytes?
3eb57ae to
d7fdd1b
Compare
src/main/Config.cpp
Outdated
There was a problem hiding this comment.
forgot to update example config
There was a problem hiding this comment.
but see the main convo, we may want to pick different parameters exposed in config (and derive those) if this is not easy to configure
src/overlay/FlowControlCapacity.cpp
Outdated
There was a problem hiding this comment.
We can't apply "capacity is always positive" reasoning to reading capacity as we don't know what non-flood traffic a peer is sending us and it's not controlled with SEND_MORE messages (we only define max message size for flood traffic, like transactions).
Oh I see, I missed this - what is the point of the global in bytes reading capacity?
I think the reason we want to track in bytes is to avoid network congestion, and this does not contribute to this.
This one is more for DoS protection then right? If so, we should just have enough capacity for the largest message possible with some buffer on top - so something in the order of MAX_MESSAGE_SIZE*5 (right now MAX_MESSAGE_SIZE is 16MB btw).
I still find it hard to reason about the proposal and the relationship between capacity and batch size - specifically, I can't tell if this logic will actually unblock nodes in all possible situations, with all possible combinations of messages in the queue. I find it easier to reason about the "capacity goes negative" approach as we we allow at most one more message to be sent to unblock the node.
We went over this offline already.
I actually find it harder to reason when we go negative because we have implicit limits.
In the current version for example, you have PEER_READING_CAPACITY_BYTES of about 100KB but you can read 16MB of data.
Capacity & limits should be explicit so that we can properly reason about resource utilization, right now based on settings it looks like we would only about 100KB of memory per connection but this is way off.
Same thing applies to what is "on the wire": when modeling based on bandwidth numbers, you really want to configure based on the "real" numbers.
With this change, we're re-purposing the batch flag as now there are strict requirements for it vs capacity. I don't really like that we're taking a flag that was only meant to be there to improve perf and make it part of correctness requirements. I think this will also cause confusion for the operators if they decide to change their configs. I think it'll be easier to communicate that whatever capacity they pick implies up to (capacity + max_message_size - 1) max capacity.
Those settings were always "advanced" in a way, because what we really want to control with them is max latency. I don't think making additional checks on startup to ensure that people don't do bad things is that special.
I agree that we need to make configuration as simple as possible. We went back and forth on how to configure batch size in the original flow control PR (if you remember we had things expressed as a fraction of the capacity at some point).
This is related to my previous comment on what is easier to reason about for people, in the context of controlling latency, I think that expressing that in terms of true bytes makes sense (because if people know the bandwidth they have, they can do the math on how to allocate it).
We could look into better ways to express those settings (but I don't think anybody is really going to try to configure those things).
src/main/Config.cpp
Outdated
There was a problem hiding this comment.
are we planning to remove these constants in the future?
src/main/Config.cpp
Outdated
src/main/Config.cpp
Outdated
src/main/Config.cpp
Outdated
There was a problem hiding this comment.
what is the reason to change it from uint to int?
src/main/Config.h
Outdated
There was a problem hiding this comment.
Can these have negative values? Or do we want to keep it same as all others, int instead of uint?
d7fdd1b to
b2f9455
Compare
b2f9455 to
dee35dc
Compare
f1ff8b7 to
42089cd
Compare
| // Additional number of bytes to add to every payment operation in loadgen | ||
| std::vector<uint32> LOADGEN_OP_SIZE_FOR_TESTING; | ||
| std::vector<uint32> LOADGEN_OP_SIZE_DISTRIBUTION_FOR_TESTING; | ||
|
|
| // processes `FLOW_CONTROL_SEND_MORE_BATCH_SIZE_BYTES` bytes | ||
| uint32_t FLOW_CONTROL_SEND_MORE_BATCH_SIZE_BYTES; | ||
|
|
||
| // For testing only, remove later |
| mCapacity.mFloodCapacity -= msgResources; | ||
| if (mCapacity.mFloodCapacity == 0) | ||
| { | ||
| CLOG_DEBUG(Overlay, "No flood capacity for peer {}", |
There was a problem hiding this comment.
msg is accepted (returns true) but prints "No flood capacity", you mean no flood capacity left anymore in debug?
| { | ||
| if (mCapacity.mFloodCapacity == 0) | ||
| { | ||
| CLOG_DEBUG(Overlay, "Got flood capacity for peer {} ({})", |
There was a problem hiding this comment.
+1 adding to both total capacity and flood capacity, and debug log when will make flood capacity >=0
| ZoneScoped; | ||
| releaseAssert(msg.type() == SEND_MORE || msg.type() == SEND_MORE_EXTENDED); | ||
| auto numMessages = Peer::getNumMessages(msg); | ||
| if (!hasOutboundCapacity(msg) && numMessages != 0) |
There was a problem hiding this comment.
do we use hasOutboundCapacity() somewhere else?
| CLOG_DEBUG(Overlay, "Got outbound capacity for peer {}", | ||
| mApp.getConfig().toShortString(mNodeID)); | ||
| } | ||
| mOutboundCapacity += msg.sendMoreExtendedMessage().numBytes; |
There was a problem hiding this comment.
Can you add a few lines of comments (or a brief description in PR summary) regarding: sendMoreExtendedMessage + releaseOutbound() both in FlowControlCapacityBytes and FlowControlCapacity
Updates to switch flow control capacity accounting to bytes instead of message counts. This allows us to support larger transactions (i.e. Soroban) better.