-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
With some fairly heavy ingest, I've been able to easily OOM Core when restarting it because the wal replay concurrency loads all the wal files and uses 10s of GB of memory. As the help for --wal-replay-concurrency-limit mentions, setting this to a limit to a value of 10 usually resolves this issue, as it does for me too.
I argue that the default value of unlimited (usize::MAX) is too aggressive. There's a limit to the system's concurrency either through cpu servicing io operations or the pipe to the file system (either disk or object store) or memory to hold the contents of wal files. We should balance speed of replay (~concurrency level) with preventing OOMs by default.
Adjusting the concurrency is an indirect way to control the memory usage - a direct setting for limiting memory usage during replay would be ideal, but in the absence of that, I think the default for concurrency should be a small multiple of the number of cores available (0.5x, 1x, 2x) - in this case, I recommend 1x with a min of 10. This gets some decent preloading on small systems while matching our current recommendation. On large systems, there can be more preloading where there is likely more memory too.
This would create a default out of box configuration that is likely to succeed in most cases even if it's slower than it could be. I think users (myself included) prefer functionality over speed. For users who need the faster wal replay, they will need to tune settings anyway and then can take into account their own data patterns and hardware.
Here's the code in question:
influxdb/influxdb3_wal/src/object_store.rs
Lines 179 to 183 in d396921
| // Load N files concurrently and then replay them immediately before loading the next batch | |
| // of N files. Since replaying has to happen _in order_ only loading the files part is | |
| // concurrent, replaying the WAL file itself is done sequentially based on the original | |
| // order (i.e paths, which is already sorted) | |
| for batched in paths.chunks(concurrency_limit.unwrap_or(usize::MAX)) { |
As the comment notes, it only loads the files concurrently -- they still have to be replayed in order. The loading entails reading the whole wal file's bytes into memory and then deserializing it and then holding onto the bytes in memory until the in-order replay can occur. It'd be better to pipeline loading and replay but without doing that refactor I think we can keep the speed of replay from #25643 while limiting OOMs by using num_cpus::get().
related
- feat: add concurrency limit for WAL replay #26483
- feat: Significantly decrease startup times for WAL #25643 and Make WAL load files in parallel on startup #25534
--table-index-cache-concurrency-limitdefault is currently20- Currently Dafafusion's runtime thread count is set to the number of cores available via num_cpus if not configured differently.
- Introduce CLI arg to control concurrency limit when loading snapshots #26498
- internal slack convo with @praveen-influx