Skip to content

Flow control in bytes#3607

Closed
marta-lokhova wants to merge 9 commits intostellar:masterfrom
marta-lokhova:flow_control_bytes_NOV
Closed

Flow control in bytes#3607
marta-lokhova wants to merge 9 commits intostellar:masterfrom
marta-lokhova:flow_control_bytes_NOV

Conversation

@marta-lokhova
Copy link
Contributor

@marta-lokhova marta-lokhova commented Nov 22, 2022

Updates to switch flow control capacity accounting to bytes instead of message counts. This allows us to support larger transactions (i.e. Soroban) better.

@marta-lokhova marta-lokhova force-pushed the flow_control_bytes_NOV branch 2 times, most recently from 6cbe869 to 8acb946 Compare November 23, 2022 17:55
@marta-lokhova marta-lokhova marked this pull request as ready for review November 29, 2022 23:29
@marta-lokhova marta-lokhova force-pushed the flow_control_bytes_NOV branch 3 times, most recently from d4e98d5 to 33c8262 Compare December 16, 2022 18:55
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need to update the Visual Studio project file before merging

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need mFlowControlMessages->getOutboundCapacity() > 0 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't, message outbound capacity is guaranteed to always be positive (I initially made this change when I made outbound capacity int64 instead of uint64, but forgot to remove it). Removed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not allow to go beyond capacity, ie code should be

    auto capacity = getMsgResourceCount(msg);
    releaseAssert(mOutboundCapacity > capacity);
    mOutboundCapacity -= capacity;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or follow tryLock semantics like in lockLocalCapacity

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be cleaned up as part of this refactor (I am not really sure why it's done like this in the first place):

  • lockLocalCapacity should be called tryLockLocalCapacity
  • we should not update capacity if we can't lock
  • capacity should never get negative (this points to not having a clean definition of what we're trying to enforce)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing both this and the other comment around tryLock: I don't believe the proposed approach is correct. I intentionally implemented it this way, because right now messages take up arbitrary capacity that the receiver does not know ahead of time.

Consider a simple scenario: the receiver sets its batch size to 100 bytes, and asks the sender for 100 bytes worth of traffic. The sender has the next SCP message to send of size 300 bytes. Unless we allow capacity to temporarily go negative, the system deadlocks. To unblock it, we allow going past capacity for up to (message_size - 1).

Note that we wouldn't need to do this if we implemented flow control based on uniformly sized batches (no need to go negative if we know max batch size ahead of time), but I'm not doing it in this iteration because (a) benefits of this approach are unclear and (b) seems like it'll add more complexity to the code since we'll need to deconstruct/reconstruct messages, reconcile when to request more data based on complete messages vs batches received, etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the deadlock comment, that sounds more like a limitation of the design that I imagine we can overcome? In your example, we could just wait to receive more "send more" messages until there is more than 300?
Accounting in bytes or in terms of messages (where we already send multiple messages until we use up capacity) should behave the same way: if a peer sends us too much data, that means it's not respecting flow control, right?

Reason I flagged this in the first place is that if you have capacity left at 100 bytes we allow to receive a 100KB message (or whatever the max message size is) but we would block after receiving a small (200 bytes) transaction which means we'd allow ~2x the configured quota for large messages.

Copy link
Contributor Author

@marta-lokhova marta-lokhova Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the deadlock comment, that sounds more like a limitation of the design that I imagine we can overcome? In your example, we could just wait to receive more "send more" messages until there is more than 300?

Send more is triggered when capacity is released. This means that the node needs to receive a 100 bytes worth of data before requesting more data. If it doesn't do it and requests more data arbitrarily, then that defeats the purpose of flow control. I'm not sure I understand what you're proposing, but I think it's the same thing as the current implementation (except we shouldn't send arbitrary "send more"): at the end, node receives 300 bytes when it only has 100 bytes of capacity to spare. This limitation comes from the fact that we transmit full messages, and we don't know their sizes ahead of time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this further after our chat, and even sketched the change, bit still found several issues:

  • We can't apply "capacity is always positive" reasoning to reading capacity as we don't know what non-flood traffic a peer is sending us and it's not controlled with SEND_MORE messages (we only define max message size for flood traffic, like transactions).
  • I still find it hard to reason about the proposal and the relationship between capacity and batch size - specifically, I can't tell if this logic will actually unblock nodes in all possible situations, with all possible combinations of messages in the queue. I find it easier to reason about the "capacity goes negative" approach as we we allow at most one more message to be sent to unblock the node.
  • With this change, we're re-purposing the batch flag as now there are strict requirements for it vs capacity. I don't really like that we're taking a flag that was only meant to be there to improve perf and make it part of correctness requirements. I think this will also cause confusion for the operators if they decide to change their configs. I think it'll be easier to communicate that whatever capacity they pick implies up to (capacity + max_message_size - 1) max capacity.

Let me know what you think, but I'd like to re-visit the currently implemented approach, as it seems easier to reason about

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't apply "capacity is always positive" reasoning to reading capacity as we don't know what non-flood traffic a peer is sending us and it's not controlled with SEND_MORE messages (we only define max message size for flood traffic, like transactions).

Oh I see, I missed this - what is the point of the global in bytes reading capacity?
I think the reason we want to track in bytes is to avoid network congestion, and this does not contribute to this.
This one is more for DoS protection then right? If so, we should just have enough capacity for the largest message possible with some buffer on top - so something in the order of MAX_MESSAGE_SIZE*5 (right now MAX_MESSAGE_SIZE is 16MB btw).

I still find it hard to reason about the proposal and the relationship between capacity and batch size - specifically, I can't tell if this logic will actually unblock nodes in all possible situations, with all possible combinations of messages in the queue. I find it easier to reason about the "capacity goes negative" approach as we we allow at most one more message to be sent to unblock the node.

We went over this offline already.

I actually find it harder to reason when we go negative because we have implicit limits.
In the current version for example, you have PEER_READING_CAPACITY_BYTES of about 100KB but you can read 16MB of data.

Capacity & limits should be explicit so that we can properly reason about resource utilization, right now based on settings it looks like we would only about 100KB of memory per connection but this is way off.

Same thing applies to what is "on the wire": when modeling based on bandwidth numbers, you really want to configure based on the "real" numbers.

With this change, we're re-purposing the batch flag as now there are strict requirements for it vs capacity. I don't really like that we're taking a flag that was only meant to be there to improve perf and make it part of correctness requirements. I think this will also cause confusion for the operators if they decide to change their configs. I think it'll be easier to communicate that whatever capacity they pick implies up to (capacity + max_message_size - 1) max capacity.

Those settings were always "advanced" in a way, because what we really want to control with them is max latency. I don't think making additional checks on startup to ensure that people don't do bad things is that special.

I agree that we need to make configuration as simple as possible. We went back and forth on how to configure batch size in the original flow control PR (if you remember we had things expressed as a fraction of the capacity at some point).

This is related to my previous comment on what is easier to reason about for people, in the context of controlling latency, I think that expressing that in terms of true bytes makes sense (because if people know the bandwidth they have, they can do the math on how to allocate it).
We could look into better ways to express those settings (but I don't think anybody is really going to try to configure those things).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am not following the discussion in this thread. Can you briefly explain the steps in 300K message + with capacity going negative approach?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are implementing traffic shaping but in a different way. I think this as a way of some sort of bandwidth limit on top of message count instead. Message size can be arbitrary and we do not divide messages into multiple iterations. For example, if capacity size is 100 and we have already received two messages with 40 and 50, and the next message will be 30. Receiver does not know the size beforehand but still need to ask more because there is 10 available capacity; and hence we accept 30 and make it -20 and now stop asking more messages.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this PR serves its purpose for now. For future as an improvement,we can introduce a burst size in addition to capacity size, and that may help us make it better represented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some changes that should hopefully help unblock this thread:

  • Capacity is now always positive, but there's a requirement for capacity vs maximum flood message size (what we discussed offline)
  • Byte accounting does not apply to reading capacity (for now). This is for simplicity, as currently maximum message size is really big, so I'd like to think about the best way to protect against DoS with large messages. We can still move this change forward, and add reading capacity incrementally (it's a purely internal change, so can be added separately without a version bump)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having a hard time understanding how we'd set this limit:

  • is "10 MB" too high or too low?
  • other limit is expressed relative to ledger size, which is easy to understand
  • there is a lower bound that comes from business requirements
  • the limit itself is trying to control for latency? something else?

Right now that limit seems too high, but I don't fully understand what we're trying to do yet and if a simple limit like this is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit here is related to memory DoS:

  • Not sure how we can reason about ledger size here, because we don't know transaction size distribution per ledger. 1000 ops of 300 bytes is very different from 1000 ops of 128kb.
  • I'm assuming transactions of arbitrary sizes can end up in this queue. Because the variance is too big, instead I decided to put a limit that is high enough in a sense that it'll still likely allow most/all transactions, but in some degenerate cases we don't blow up our memory (large txs add up fast with the current limit of 1000 - 128kb*1000=128MB per just one peer).
  • I'm not sure it's too high - we're still enforcing 1000 txs per queue limit, this limit is just to prevent the queue from growing too big.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern here is that this limit is fine for protecting against out of memory DoS, but it may still trigger a lot of badness: I think what we discussed in a different context was maybe moving to latency based queue policies that would avoid creating "spiral of death" situations. 10MB seems like a lot of data to move especially if there are a lot of connections competing against each other (100 peers type of situation -> 1GB of stuff queued up to be sent)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to move away from reasoning that ~100 peers is the norm. We have work in flight that actively tries to reduce this number by a lot. Even without it, a more realistic number based on pubnet data is ~50-60. That being said, I agree that we may still create too much overhead.

What I was thinking we could do is use the configs for "max classic ops" vs "max Soroban ops" per ledger to calculate this limit. The reason I didn't do it is because we don't have those configs yet, and we don't even know what the values would be. Hence the static limit for now. I think we can change it in the next iteration once we have a better idea about what the distribution is going to be like. In the meantime, I'll reduce the limit to something reasonable, like 3MB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: FLOW_CONTROL_SEND_MORE_BATCH_SIZE_BYTES bytes?

@marta-lokhova marta-lokhova force-pushed the flow_control_bytes_NOV branch 2 times, most recently from 3eb57ae to d7fdd1b Compare January 6, 2023 19:25
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot to update example config

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but see the main convo, we may want to pick different parameters exposed in config (and derive those) if this is not easy to configure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't apply "capacity is always positive" reasoning to reading capacity as we don't know what non-flood traffic a peer is sending us and it's not controlled with SEND_MORE messages (we only define max message size for flood traffic, like transactions).

Oh I see, I missed this - what is the point of the global in bytes reading capacity?
I think the reason we want to track in bytes is to avoid network congestion, and this does not contribute to this.
This one is more for DoS protection then right? If so, we should just have enough capacity for the largest message possible with some buffer on top - so something in the order of MAX_MESSAGE_SIZE*5 (right now MAX_MESSAGE_SIZE is 16MB btw).

I still find it hard to reason about the proposal and the relationship between capacity and batch size - specifically, I can't tell if this logic will actually unblock nodes in all possible situations, with all possible combinations of messages in the queue. I find it easier to reason about the "capacity goes negative" approach as we we allow at most one more message to be sent to unblock the node.

We went over this offline already.

I actually find it harder to reason when we go negative because we have implicit limits.
In the current version for example, you have PEER_READING_CAPACITY_BYTES of about 100KB but you can read 16MB of data.

Capacity & limits should be explicit so that we can properly reason about resource utilization, right now based on settings it looks like we would only about 100KB of memory per connection but this is way off.

Same thing applies to what is "on the wire": when modeling based on bandwidth numbers, you really want to configure based on the "real" numbers.

With this change, we're re-purposing the batch flag as now there are strict requirements for it vs capacity. I don't really like that we're taking a flag that was only meant to be there to improve perf and make it part of correctness requirements. I think this will also cause confusion for the operators if they decide to change their configs. I think it'll be easier to communicate that whatever capacity they pick implies up to (capacity + max_message_size - 1) max capacity.

Those settings were always "advanced" in a way, because what we really want to control with them is max latency. I don't think making additional checks on startup to ensure that people don't do bad things is that special.

I agree that we need to make configuration as simple as possible. We went back and forth on how to configure batch size in the original flow control PR (if you remember we had things expressed as a fraction of the capacity at some point).

This is related to my previous comment on what is easier to reason about for people, in the context of controlling latency, I think that expressing that in terms of true bytes makes sense (because if people know the bandwidth they have, they can do the math on how to allocate it).
We could look into better ways to express those settings (but I don't think anybody is really going to try to configure those things).

@mbsdf mbsdf self-requested a review January 19, 2023 21:31
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we planning to remove these constants in the future?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+951

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason to change it from uint to int?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these have negative values? Or do we want to keep it same as all others, int instead of uint?

@marta-lokhova marta-lokhova force-pushed the flow_control_bytes_NOV branch from d7fdd1b to b2f9455 Compare January 27, 2023 23:43
@marta-lokhova marta-lokhova force-pushed the flow_control_bytes_NOV branch from b2f9455 to dee35dc Compare January 30, 2023 20:23
@marta-lokhova marta-lokhova force-pushed the flow_control_bytes_NOV branch from f1ff8b7 to 42089cd Compare January 31, 2023 20:37
// Additional number of bytes to add to every payment operation in loadgen
std::vector<uint32> LOADGEN_OP_SIZE_FOR_TESTING;
std::vector<uint32> LOADGEN_OP_SIZE_DISTRIBUTION_FOR_TESTING;

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

// processes `FLOW_CONTROL_SEND_MORE_BATCH_SIZE_BYTES` bytes
uint32_t FLOW_CONTROL_SEND_MORE_BATCH_SIZE_BYTES;

// For testing only, remove later
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

mCapacity.mFloodCapacity -= msgResources;
if (mCapacity.mFloodCapacity == 0)
{
CLOG_DEBUG(Overlay, "No flood capacity for peer {}",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

msg is accepted (returns true) but prints "No flood capacity", you mean no flood capacity left anymore in debug?

{
if (mCapacity.mFloodCapacity == 0)
{
CLOG_DEBUG(Overlay, "Got flood capacity for peer {} ({})",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 adding to both total capacity and flood capacity, and debug log when will make flood capacity >=0

ZoneScoped;
releaseAssert(msg.type() == SEND_MORE || msg.type() == SEND_MORE_EXTENDED);
auto numMessages = Peer::getNumMessages(msg);
if (!hasOutboundCapacity(msg) && numMessages != 0)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we use hasOutboundCapacity() somewhere else?

CLOG_DEBUG(Overlay, "Got outbound capacity for peer {}",
mApp.getConfig().toShortString(mNodeID));
}
mOutboundCapacity += msg.sendMoreExtendedMessage().numBytes;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a few lines of comments (or a brief description in PR summary) regarding: sendMoreExtendedMessage + releaseOutbound() both in FlowControlCapacityBytes and FlowControlCapacity

Copy link

@mbsdf mbsdf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so far

@marta-lokhova
Copy link
Contributor Author

Closing this PR: the change has been split into #3737 and #3686

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants