Optimise WAL loading by caching min-time by bboreham · Pull Request #8645 · prometheus/prometheus

bboreham · 2021-03-25T11:42:29Z

Profiling shows that computing the last timestamp of mapped chunks inside append() takes ~10% of CPU time, so compute and store it once per series.

Edit: also fix some issues in the benchmark.

pracucci

The change makes sense to me. What does BenchmarkLoadWAL() says (eg. running with -count=5 and then comparing with benchstat?

tsdb/head.go

bboreham · 2021-03-25T23:25:32Z

As far as I can see BenchmarkLoadWAL() sets up a WAL with no mmapped chunks, so the code I noted in the profile is not executed.

I think I need some advice on how to set up a benchmark with both WAL samples and mmapped chunks.

bboreham · 2021-07-19T09:22:56Z

I spent some time drilling into the behaviour of the code, figured out how to create some mmapped chunks in BenchmarkLoadWAL().
When I profile the benchmark, much of the time is spent in the Go scheduler; is that expected?
I think I found one flaw in the benchmark; I'm interested to hear whether it has been closely examined before.

codesome · 2021-07-19T10:38:04Z

I'm interested to hear whether it has been closely examined before.

Prolly not

So that goroutines are stopped and resources released Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Series are distributed across workers by taking the modulus of the ID with the number of workers, so multiples of 100 are a poor choice. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Real Prometheus cuts chunks every 120 samples, then skips those samples when re-reading the WAL. Simulate this by creating a single mapped chunk for each series, since the max time is all the reader looks at. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Profiling shows that computing the last timestamp of mapped chunks inside `append()` takes significant CPU time, so compute and store it once per series. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

bboreham · 2021-07-20T12:17:59Z

I have added a new option to BenchmarkLoadWAL() to simulate mmapped chunks, and a few more commits to address some items noted at #9101.

Headline benchmark numbers say not much has changed:

name                                                                                                    old time/op    new time/op    delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-4         582ms ±20%     537ms ± 7%    ~     (p=0.222 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-4         866ms ±17%     814ms ±10%    ~     (p=0.310 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-4         329ms ± 2%     326ms ± 5%    ~     (p=0.690 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-4     3.26s ± 5%     3.21s ± 3%    ~     (p=0.421 n=5+5)

name                                                                                                    old alloc/op   new alloc/op   delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-4        52.1MB ± 0%    52.1MB ± 0%  +0.07%  (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-4         260MB ± 8%     254MB ± 7%    ~     (p=0.310 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-4        51.5MB ± 2%    51.9MB ± 3%    ~     (p=0.841 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-4     351MB ± 0%     361MB ± 1%  +2.69%  (p=0.008 n=5+5)

name                                                                                                    old allocs/op  new allocs/op  delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-4          586k ± 0%      586k ± 0%    ~     (p=0.421 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-4         2.49M ± 0%     2.49M ± 0%    ~     (p=0.310 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-4          487k ± 0%      487k ± 0%    ~     (p=0.905 n=5+4)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-4     3.73M ± 0%     3.73M ± 0%    ~     (p=0.841 n=5+5)

however if we look at the amount of CPU used during the test with simulated mmapped chunks we see about a 10% improvement, in line with what the profile showed at #8638:

before: 13.88user 2.82system 0:07.31elapsed 228%CPU (0avgtext+0avgdata 275464maxresident)k
after:  12.39user 2.26system 0:07.02elapsed 208%CPU (0avgtext+0avgdata 275956maxresident)k

Timings of loading a real 5GB WAL:

before: 161.75user 27.56system 1:05.66elapsed 288%CPU (0avgtext+0avgdata 3505748maxresident)k
after:  131.73user 26.46system 0:59.90elapsed 264%CPU (0avgtext+0avgdata 3804560maxresident)k

cstyan

Can we leave a comment somewhere to the effect that skipping samples that we know will be rejected in append is the optimization you're making around line 468/469. It's not immediately clear that caching of minTime is that important without looking at the linked issue.

bboreham · 2021-07-26T08:28:53Z

I started to re-word the comment, then realised the whole basis for this cache is incorrect.

"Mitigate lock contention in getByID", it said, but the locks are sharded 32,768 ways so with any likely number of workers there is no contention.

Removing refSeries altogether makes my test go about as fast as adding minTime did. The real problem must be somewhere else.

cstyan · 2021-07-27T04:06:58Z

Removing refSeries altogether makes my test go about as fast as adding minTime did. The real problem must be somewhere else.

Interesting.

bboreham · 2021-08-05T10:33:49Z

Replaced by #9160

bboreham requested a review from codesome as a code owner March 25, 2021 11:42

pracucci reviewed Mar 25, 2021

View reviewed changes

tsdb/head.go Outdated Show resolved Hide resolved

stale bot added the stale label May 31, 2021

bboreham force-pushed the cache-wal-min branch from 24e7e16 to 9d98bec Compare July 17, 2021 20:08

stale bot removed the stale label Jul 17, 2021

bboreham mentioned this pull request Jul 19, 2021

Problems with BenchmarkLoadWAL #9101

Closed

bboreham force-pushed the cache-wal-min branch from 9d98bec to 3783c24 Compare July 20, 2021 11:34

bboreham added 6 commits July 20, 2021 11:35

BenchmarkLoadWAL: close WAL after use

212bb84

So that goroutines are stopped and resources released Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

BenchmarkLoadWAL: make series IDs co-prime with #workers

8a1794b

Series are distributed across workers by taking the modulus of the ID with the number of workers, so multiples of 100 are a poor choice. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Fix comment

7f700e9

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Speed up WAL loading by caching min time

054ca50

Profiling shows that computing the last timestamp of mapped chunks inside `append()` takes significant CPU time, so compute and store it once per series. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Make minTime the minimum possible

99e7d82

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

bboreham force-pushed the cache-wal-min branch from 3783c24 to 99e7d82 Compare July 20, 2021 11:36

bboreham mentioned this pull request Jul 20, 2021

tsdb: use dennwc/varint to speed up WAL decoding #9106

Merged

cstyan reviewed Jul 23, 2021

View reviewed changes

bboreham marked this pull request as draft July 26, 2021 08:23

bboreham mentioned this pull request Aug 5, 2021

Optimise WAL loading by removing extra map and caching min-time #9160

Merged

bboreham closed this Aug 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise WAL loading by caching min-time#8645

Optimise WAL loading by caching min-time#8645
bboreham wants to merge 6 commits intoprometheus:mainfrom
bboreham:cache-wal-min

bboreham commented Mar 25, 2021 •

edited

Loading

Uh oh!

pracucci left a comment

Uh oh!

Uh oh!

bboreham commented Mar 25, 2021

Uh oh!

bboreham commented Jul 19, 2021

Uh oh!

codesome commented Jul 19, 2021

Uh oh!

bboreham commented Jul 20, 2021 •

edited

Loading

Uh oh!

cstyan left a comment

Uh oh!

bboreham commented Jul 26, 2021

Uh oh!

cstyan commented Jul 27, 2021

Uh oh!

bboreham commented Aug 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

bboreham commented Mar 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bboreham commented Mar 25, 2021

Uh oh!

bboreham commented Jul 19, 2021

Uh oh!

codesome commented Jul 19, 2021

Uh oh!

bboreham commented Jul 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cstyan left a comment

Choose a reason for hiding this comment

Uh oh!

bboreham commented Jul 26, 2021

Uh oh!

cstyan commented Jul 27, 2021

Uh oh!

bboreham commented Aug 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bboreham commented Mar 25, 2021 •

edited

Loading

bboreham commented Jul 20, 2021 •

edited

Loading