Skip to content

Optimise WAL loading by caching min-time#8645

Closed
bboreham wants to merge 6 commits intoprometheus:mainfrom
bboreham:cache-wal-min
Closed

Optimise WAL loading by caching min-time#8645
bboreham wants to merge 6 commits intoprometheus:mainfrom
bboreham:cache-wal-min

Conversation

@bboreham
Copy link
Member

@bboreham bboreham commented Mar 25, 2021

Fixes #8638

Profiling shows that computing the last timestamp of mapped chunks inside append() takes ~10% of CPU time, so compute and store it once per series.

Edit: also fix some issues in the benchmark.

@bboreham bboreham requested a review from codesome as a code owner March 25, 2021 11:42
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes sense to me. What does BenchmarkLoadWAL() says (eg. running with -count=5 and then comparing with benchstat?

@bboreham
Copy link
Member Author

As far as I can see BenchmarkLoadWAL() sets up a WAL with no mmapped chunks, so the code I noted in the profile is not executed.

I think I need some advice on how to set up a benchmark with both WAL samples and mmapped chunks.

@stale stale bot added the stale label May 31, 2021
@stale stale bot removed the stale label Jul 17, 2021
@bboreham
Copy link
Member Author

I spent some time drilling into the behaviour of the code, figured out how to create some mmapped chunks in BenchmarkLoadWAL().
When I profile the benchmark, much of the time is spent in the Go scheduler; is that expected?
I think I found one flaw in the benchmark; I'm interested to hear whether it has been closely examined before.

@codesome
Copy link
Member

I'm interested to hear whether it has been closely examined before.

Prolly not

bboreham added 6 commits July 20, 2021 11:35
So that goroutines are stopped and resources released

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Series are distributed across workers by taking the modulus of the
ID with the number of workers, so multiples of 100 are a poor choice.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Real Prometheus cuts chunks every 120 samples, then skips those samples
when re-reading the WAL. Simulate this by creating a single mapped chunk
for each series, since the max time is all the reader looks at.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Profiling shows that computing the last timestamp of mapped chunks
inside `append()` takes significant CPU time, so compute and store
it once per series.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
@bboreham
Copy link
Member Author

bboreham commented Jul 20, 2021

I have added a new option to BenchmarkLoadWAL() to simulate mmapped chunks, and a few more commits to address some items noted at #9101.

Headline benchmark numbers say not much has changed:

name                                                                                                    old time/op    new time/op    delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-4         582ms ±20%     537ms ± 7%    ~     (p=0.222 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-4         866ms ±17%     814ms ±10%    ~     (p=0.310 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-4         329ms ± 2%     326ms ± 5%    ~     (p=0.690 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-4     3.26s ± 5%     3.21s ± 3%    ~     (p=0.421 n=5+5)

name                                                                                                    old alloc/op   new alloc/op   delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-4        52.1MB ± 0%    52.1MB ± 0%  +0.07%  (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-4         260MB ± 8%     254MB ± 7%    ~     (p=0.310 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-4        51.5MB ± 2%    51.9MB ± 3%    ~     (p=0.841 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-4     351MB ± 0%     361MB ± 1%  +2.69%  (p=0.008 n=5+5)

name                                                                                                    old allocs/op  new allocs/op  delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-4          586k ± 0%      586k ± 0%    ~     (p=0.421 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-4         2.49M ± 0%     2.49M ± 0%    ~     (p=0.310 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-4          487k ± 0%      487k ± 0%    ~     (p=0.905 n=5+4)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-4     3.73M ± 0%     3.73M ± 0%    ~     (p=0.841 n=5+5)

however if we look at the amount of CPU used during the test with simulated mmapped chunks we see about a 10% improvement, in line with what the profile showed at #8638:

before: 13.88user 2.82system 0:07.31elapsed 228%CPU (0avgtext+0avgdata 275464maxresident)k
after:  12.39user 2.26system 0:07.02elapsed 208%CPU (0avgtext+0avgdata 275956maxresident)k

Timings of loading a real 5GB WAL:

before: 161.75user 27.56system 1:05.66elapsed 288%CPU (0avgtext+0avgdata 3505748maxresident)k
after:  131.73user 26.46system 0:59.90elapsed 264%CPU (0avgtext+0avgdata 3804560maxresident)k

Copy link
Member

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave a comment somewhere to the effect that skipping samples that we know will be rejected in append is the optimization you're making around line 468/469. It's not immediately clear that caching of minTime is that important without looking at the linked issue.

@bboreham bboreham marked this pull request as draft July 26, 2021 08:23
@bboreham
Copy link
Member Author

I started to re-word the comment, then realised the whole basis for this cache is incorrect.

"Mitigate lock contention in getByID", it said, but the locks are sharded 32,768 ways so with any likely number of workers there is no contention.

Removing refSeries altogether makes my test go about as fast as adding minTime did. The real problem must be somewhere else.

@cstyan
Copy link
Member

cstyan commented Jul 27, 2021

Removing refSeries altogether makes my test go about as fast as adding minTime did. The real problem must be somewhere else.

Interesting.

@bboreham
Copy link
Member Author

bboreham commented Aug 5, 2021

Replaced by #9160

@bboreham bboreham closed this Aug 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Idea to speed up WAL loading

5 participants