Skip to content

[v3][core] Set default for wal-replay-concurrency-limit to number of cpu cores instead of unlimited #26715

@philjb

Description

@philjb

With some fairly heavy ingest, I've been able to easily OOM Core when restarting it because the wal replay concurrency loads all the wal files and uses 10s of GB of memory. As the help for --wal-replay-concurrency-limit mentions, setting this to a limit to a value of 10 usually resolves this issue, as it does for me too.

I argue that the default value of unlimited (usize::MAX) is too aggressive. There's a limit to the system's concurrency either through cpu servicing io operations or the pipe to the file system (either disk or object store) or memory to hold the contents of wal files. We should balance speed of replay (~concurrency level) with preventing OOMs by default.

Adjusting the concurrency is an indirect way to control the memory usage - a direct setting for limiting memory usage during replay would be ideal, but in the absence of that, I think the default for concurrency should be a small multiple of the number of cores available (0.5x, 1x, 2x) - in this case, I recommend 1x with a min of 10. This gets some decent preloading on small systems while matching our current recommendation. On large systems, there can be more preloading where there is likely more memory too.

This would create a default out of box configuration that is likely to succeed in most cases even if it's slower than it could be. I think users (myself included) prefer functionality over speed. For users who need the faster wal replay, they will need to tune settings anyway and then can take into account their own data patterns and hardware.

Here's the code in question:

// Load N files concurrently and then replay them immediately before loading the next batch
// of N files. Since replaying has to happen _in order_ only loading the files part is
// concurrent, replaying the WAL file itself is done sequentially based on the original
// order (i.e paths, which is already sorted)
for batched in paths.chunks(concurrency_limit.unwrap_or(usize::MAX)) {

As the comment notes, it only loads the files concurrently -- they still have to be replayed in order. The loading entails reading the whole wal file's bytes into memory and then deserializing it and then holding onto the bytes in memory until the in-order replay can occur. It'd be better to pipeline loading and replay but without doing that refactor I think we can keep the speed of replay from #25643 while limiting OOMs by using num_cpus::get().

related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions