fix(wal_replay): limit the number of wal files preloaded to num_cpus by philjb · Pull Request #26716 · influxdata/influxdb

philjb · 2025-08-21T20:59:20Z

Wal replay currently loads all the wal files into memory and decodes them by default. If that's 10s of GB or 100s of GB, it'll try to do it potentially causing OOMs if it exceeds system memory. We likely keep most of the speed from preloading but decrease chance of OOM by preloading a more limited number of wal files. In the absence of an option to directly limit the memory used in preload, we can use the number of cpu cores available as a proxy. This will be the number of wal files loaded to replay, which has to happen in order still. The current recommendation is to use 10 if you encounter an OOM so let's use that as the minimum if a specific value isn't set. The new logic is

num_files_preloaded = (user's choice) or if not set (max(10, num_cpus))

This should improve the experience restarting the server when there are a lot of wal data.

closes [v3][core] Set default for wal-replay-concurrency-limit to number of cpu cores instead of unlimited #26715

Wal replay currently loads all the wal files into memory and decodes them by default. If that's 10s of GB or 100s of GB, it'll try to do it potentially causing OOMs if it exceeds system memory. We likely keep most of the speed from preloading but decrease chance of OOM by preloading a more limited number of wal files. In the absence of an option to directly limit the memory used in preload, we can use the number of cpu cores available as a proxy. This will be the number of wal files loaded to replay, which has to happen in order still. The current recommendation is to use 10 if you encounter an OOM so let's use that as the minimum if a specific value isn't set. The new logic is num_files_preloaded = (user's choice) of if not set (max(10, num_cpus) This should improve the experience restarting the server when there is a lot of wal data. * closes #26715

hiltontj

Looks good! I have a suggestion that you can take or leave.

The only thing I would do differently is have the default set in the clap block here:

influxdb/influxdb3/src/commands/serve.rs

Lines 521 to 525 in ec343b9

    
               #[clap( 
        
                   long = "wal-replay-concurrency-limit", 
        
                   env = "INFLUXDB3_WAL_REPLAY_CONCURRENCY_LIMIT" 
        
               )] 
        
               pub wal_replay_concurrency_limit: Option<usize>,

That way, replay could take a usize instead of Option<usize>, but more importantly, the default setting is visible at the serve command level, instead of nested down in the influxdb3_wal crate there.

hiltontj · 2025-08-22T01:06:26Z

influxdb3_wal/Cargo.toml

-bytes.workspace = true
 byteorder.workspace = true
+bytes.workspace = true
+clap = "4.5.23"


Suggested change

clap = "4.5.23"

clap.workspace = true

yes of course. done.

hiltontj · 2025-08-22T01:14:00Z

influxdb3/src/help/serve_all.txt

  --wal-replay-concurrency-limit <LIMIT>
-                                  Concurrency limit during WAL replay [default: no_limit]
+                                  Concurrency limit during WAL replay [default: max(num_cpus, 10)]
                                  If replay runs into OOM, set this to a lower number eg. 10


Suggested change

If replay runs into OOM, set this to a lower number eg. 10

Setting this number too high can lead to OOM.

Or something to that effect. I think the previous statement no longer applies if we are setting a reasonable default.

good idea. done.

…oncurrency limit - Add wal_replay_concurrency_limit_default() helper fn for max(num_cpus, 10) - Change field type from Option<usize> to usize - Update help text to clarify dynamic nature and OOM warning

philjb

suggestions implemented!

philjb · 2025-08-22T21:34:10Z

influxdb3_wal/Cargo.toml

-bytes.workspace = true
 byteorder.workspace = true
+bytes.workspace = true
+clap = "4.5.23"


yes of course. done.

philjb · 2025-08-22T21:34:55Z

influxdb3/src/help/serve_all.txt

  --wal-replay-concurrency-limit <LIMIT>
-                                  Concurrency limit during WAL replay [default: no_limit]
+                                  Concurrency limit during WAL replay [default: max(num_cpus, 10)]
                                  If replay runs into OOM, set this to a lower number eg. 10


good idea. done.

philjb · 2025-08-22T22:10:42Z

influxdb3/src/commands/serve.rs

 )];

+const MIN_REPLAY_PRELOAD_CONCURRENCY: usize = 10; // the min number of files that will be held in memory
+fn wal_replay_concurrency_limit_default() -> String {


I didn't know the clap default value macro would take a function call but it does so this works nicely to dynamically determine the default value. thanks for the suggestion.

But of course we only print the curated help file, we can't actually get clap to print in cli help the calculated value in the current step up, which is kinda shame.

but i tested it by commenting out the maybe_print_help() method and it works.

Perhaps we can bring back the help subcommand which will still work with our curated help but will display all the clap generated help too.

influxdb/influxdb3/src/lib.rs

Line 51 in 03dc8c6

disable_help_subcommand = true,

* chore(wal_replay): port limit concurrency of replay * port #26716 to ent * ticket #26715 * chore(cli): port hide the token's env value in cli help * port #26714 * ticket #26713

philjb added the v3 label Aug 21, 2025

chore: update the default value in help-all

bb4ffe2

hiltontj approved these changes Aug 22, 2025

View reviewed changes

philjb added 2 commits August 22, 2025 14:58

fix(wal_replay): implement dynamic default value in clap derive for c…

467a5fc

…oncurrency limit - Add wal_replay_concurrency_limit_default() helper fn for max(num_cpus, 10) - Change field type from Option<usize> to usize - Update help text to clarify dynamic nature and OOM warning

chore: typo in help

03dc8c6

philjb commented Aug 22, 2025

View reviewed changes

philjb marked this pull request as ready for review August 22, 2025 22:58

philjb requested a review from hiltontj August 22, 2025 22:59

hiltontj approved these changes Aug 23, 2025

View reviewed changes

philjb merged commit f9c8e0a into main Aug 25, 2025
12 checks passed

philjb deleted the pjb/26715/limit-wal-replay-loading-to-cpus branch August 25, 2025 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(wal_replay): limit the number of wal files preloaded to num_cpus#26716

fix(wal_replay): limit the number of wal files preloaded to num_cpus#26716
philjb merged 4 commits intomainfrom
pjb/26715/limit-wal-replay-loading-to-cpus

philjb commented Aug 21, 2025 •

edited

Loading

Uh oh!

hiltontj left a comment

Uh oh!

hiltontj Aug 22, 2025

Uh oh!

philjb Aug 22, 2025

Uh oh!

hiltontj Aug 22, 2025

Uh oh!

philjb Aug 22, 2025

Uh oh!

philjb left a comment

Uh oh!

philjb Aug 22, 2025

Uh oh!

philjb Aug 22, 2025

Uh oh!

philjb Aug 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	#[clap(
	long = "wal-replay-concurrency-limit",
	env = "INFLUXDB3_WAL_REPLAY_CONCURRENCY_LIMIT"
	)]
	pub wal_replay_concurrency_limit: Option<usize>,

	If replay runs into OOM, set this to a lower number eg. 10
	Setting this number too high can lead to OOM.

Conversation

philjb commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hiltontj left a comment

Choose a reason for hiding this comment

Uh oh!

hiltontj Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

philjb Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

hiltontj Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

philjb Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

philjb left a comment

Choose a reason for hiding this comment

Uh oh!

philjb Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

philjb Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

philjb Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

philjb commented Aug 21, 2025 •

edited

Loading

philjb Aug 22, 2025 •

edited

Loading