Stop WAL replay from running into OOM

### Problem

When server is restarted with _significant_ (this is relative to available memory) number of WAL files, replay runs into OOM. In a constrained environment (for eg 1G) even though the snapshot process never runs into OOM it is still possible to shutdown the server and restart it to see replay struggling and eventually running into OOM. This is not even getting to a point of snapshotting (reason for comparison with snapshot is because that's when we usually see OOM currently), it looks like loading _all_ wal files concurrently is the issue (relevant code below)

https://github.com/influxdata/influxdb/blob/be25c6f52b046e57ec909b815e5471d4c6bb4f19/influxdb3_wal/src/object_store.rs#L151-L158

Once the process reaches this state, it is highly likely more memory needs to be allocated to the process (like moving from 1G to 2G). It would be better to expose some CLI parameters so that user can control how many files can be loaded concurrently.

### Solution

Stream the files and load them with a concurrency limit (this could be unbounded by default). If the user runs into OOM then that CLI parameter can be tweaked to get the files loaded again. The default could be more conservative (like 20 for example), but given the user won't be running into this often it's probably better to use unbounded as default.

```rust
        let stream = futures::stream::iter(paths);
        let mut replay_tasks = stream
            .map(|path| {
                let object_store = Arc::clone(&self.object_store);
                async move { get_contents(object_store, path).await }
            })
            .buffered(wal_replay_concurrency_limit.unwrap_or(usize::MAX));

        while let Some(wal_contents) = replay_tasks.next().await {
            let wal_contents = wal_contents?;

```

### Alternatives considered

I cannot think of any other mechanism that could work around this issue without having a cap on concurrency.

	let mut replay_tasks = Vec::new();
	for path in paths {
	let object_store = Arc::clone(&self.object_store);
	replay_tasks.push(tokio::spawn(get_contents(object_store, path)));
	}

	for wal_contents in replay_tasks {
	let wal_contents = wal_contents.await??;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop WAL replay from running into OOM #26481

Problem

Solution

Alternatives considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stop WAL replay from running into OOM #26481

Description

Problem

Solution

Alternatives considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions