What would you like to be added?
All requests made to etcd are serialized into raft entry proto and persisted on disk WAL. That's good, but to allow slow/disconnected members to catchup etcd also stores last 10`000 entries in raft.InMemoryStorage, all loaded into memory. In some cases this can cause huge memory bloat of etcd. Imagine you have a sequence of large put requests (for example 1MB configmaps in Kubernetes). etcd will keep all 10GB in memory, doing nothing.
This can be reproduced by running ./bin/tools/benchmark put --total=1000 --val-size=1000000 and collecting inuse_space heap profile.

The mechanism is really dump and could benefit from following improvements:
- Compact raft.InMemoryStorage after every apply instead of once every snapshot (10`000 entries is default snapshot frequency). With removal of v2storage we can switch to using applied index instead of snapshot index. As apply index is updated more frequently we can execute Compact more frequently, possibly after every apply assuming that it's not too costly,
- Compact raft.InMemoryStorage based state of the slowest member. Why keep 5`000 entries (default catchup entries) in 1 node cluster? or there all members are up to date? We could read the state of the slowest member and Compact based on that
- Tune the default snapshot catchup entries (5`000 entries). Current is based on 1ms latency and 10k throughput https://github.com/etcd-io/etcd/pull/2403/files. Would be good to revisit this and tune it. For example compare catchup times and availability for different sizes of WAL entries, DB file, latency, network throughput etc.
- Change the semantic of catchup-entries from "we always store at least X entries" to "we store entries only for members that are behind by X entries max". If member is behind more then X entries, as it no longer makes sense to use raft entries to sync it. In this case we will use snapshot. So why keep those entries?
Why is this needed?
Prevent etcd memory bloating and make memory usage more predictable.
What would you like to be added?
All requests made to etcd are serialized into raft entry proto and persisted on disk WAL. That's good, but to allow slow/disconnected members to catchup etcd also stores last 10`000 entries in raft.InMemoryStorage, all loaded into memory. In some cases this can cause huge memory bloat of etcd. Imagine you have a sequence of large put requests (for example 1MB configmaps in Kubernetes). etcd will keep all 10GB in memory, doing nothing.
This can be reproduced by running
./bin/tools/benchmark put --total=1000 --val-size=1000000and collecting inuse_space heap profile.The mechanism is really dump and could benefit from following improvements:
Why is this needed?
Prevent etcd memory bloating and make memory usage more predictable.