-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: unbounded memory use when falling behind on sideloaded MsgApp #73376
Description
Description
In #71802 (comment), we are seeing occasional failures due to nodes running out of memory. The heap profiles show large amounts of memory allocated by loading sideloaded SSTs into memory for appending to followers. Each individual raft leader will only pull ~one SST per append (due to our 32kb max-append-size target) but it may do so for each follower, meaning that for every leader in the system, we can expect at most num_followers * sst_size to be pulled into memory per raft cycle. Unfortunately, outgoing messages are buffered and so even a single group might put a theoretical limit of 10k SSTs into memory.
We don't have a single group but potentially tens of thousands of them, and theoretically each of them can do the above (though they all share the 10k limit or messages will be dropped wholesale). In practice, the quota pool should, on each leader, prevent too many SSTs from entering the raft layer before they've been fully distributed to the followers. The quota pool size is half the raft log truncation threshold which is 16mb, i.e. we have an 8mb proposal quota, so really, assuming SSTs that are no larger than 8mb, we expect to have only 8mb*num_followers in flight at any given time, per local raft leader.
Here we saw the heap profile track 2.11GiB. Unfortunately, we don't have the artifacts any more but even with them, it might be difficult to find out whether we are dealing with a small number of extraordinarily large SSTs vs a homogeneous flood of reasonably-sized SSTs. Still, investigating another occurrence would be helpful, in particular with an eye of when during the restore the problem occurs.
Action items
- add a histogram of raft append sizes (making sure that it allows us to distinguish between few large vs many reasonable msgs)
- switch from the cardinality-based queuing here to msg-size-based, and selectively drop messages that don't fit into the queue (how to size the queue will be an open question)
Jira issue: CRDB-11564