-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage: Raft entry cache grows (inefficiently) to huge size #13231
Description
Each store has a cache for raft entries, configured by default to be 16MB. On gamma, we see that on one node (at a time), we have over a gigabyte of raft entries in memory. I believe that not all of these entries are in the cache, but they are all being held in place by the cache, because Replica.Entries allocates one large array of raftpb.Entry objects, so that a reference to any one of them keeps the whole array alive.
Additionally, whenever we load a large array of entries, we try to add them all to the cache, inserting them all one by one and then evicting all but the last 16MB. This is actually the bigger concern in the gamma cluster at this time, since this process takes long enough (and blocks the server) so that it loses leases and never makes any progress.
Why do we load this monolithic block of entries? Whenever a new node becomes leader, it loads all uncommitted entries to see if there are any config changes. This is the one time in which we load raft entries without any chunking. It would be easy to add chunking here. In addition, we may want to change the behavior of the raft.Entries method to break up the arrays that it uses when adding entries to the cache (trading off the allocation overhead of smaller allocations vs the wasted memory of sibling array entries). Finally, this problem is also a result of our inability to throttle incoming raft entries. The log appears to keep growing beyond our ability to process it.