Skip to content

Optimise WAL loading by removing extra map and caching min-time#9160

Merged
codesome merged 10 commits intoprometheus:mainfrom
bboreham:cache-wal-min-3
Aug 10, 2021
Merged

Optimise WAL loading by removing extra map and caching min-time#9160
codesome merged 10 commits intoprometheus:mainfrom
bboreham:cache-wal-min-3

Conversation

@bboreham
Copy link
Member

@bboreham bboreham commented Aug 5, 2021

This PR replaces #8645; it includes the first four commits which address benchmark issues at #9101.
Fixes #8638.

Two code changes:

  • A: Remove the additional refSeries cache built in processWALSamples(). "Mitigate lock contention in getByID", it said, but the locks are sharded so with any likely number of workers there is no contention.
  • B: Cache the highest timestamp of any mmapped chunk for a series, so that we avoid re-looking this up for every sample. We cache it in the main memSeries struct.

I spent a lot of time exploring different options, and concluded this is a good combination.
Benchmarking them both separately:

Just A: memory is reduced when we have a lot of series, but it runs a little slower.

name                                                                                                    old time/op    new time/op    delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8         345ms ± 2%     345ms ± 1%     ~     (p=0.730 n=5+4)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8         523ms ± 2%     549ms ± 3%   +5.03%  (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8         214ms ±16%     208ms ± 0%     ~     (p=0.730 n=5+4)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8     2.19s ± 1%     2.22s ± 1%   +1.54%  (p=0.008 n=5+5)

name                                                                                                    old alloc/op   new alloc/op   delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8        45.7MB ± 0%    45.6MB ± 0%     ~     (p=0.056 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8         239MB ± 1%     241MB ± 1%     ~     (p=0.690 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8        50.9MB ± 2%    52.2MB ± 1%   +2.50%  (p=0.032 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8     333MB ± 1%     302MB ± 1%   -9.11%  (p=0.008 n=5+5)

name                                                                                                    old allocs/op  new allocs/op  delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8          545k ± 0%      545k ± 0%     ~     (p=1.000 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8         2.49M ± 0%     2.49M ± 0%   -0.10%  (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8          477k ± 0%      486k ± 1%   +1.90%  (p=0.008 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8     3.73M ± 0%     3.71M ± 0%   -0.48%  (p=0.008 n=5+5)

Just B: a bit faster, slightly higher memory use.

name                                                                                                    old time/op    new time/op    delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8         345ms ± 2%     337ms ± 1%   -2.45%  (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8         523ms ± 2%     529ms ± 2%     ~     (p=0.222 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8         214ms ±16%     193ms ± 1%   -9.90%  (p=0.016 n=5+4)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8     2.19s ± 1%     2.11s ± 1%   -3.57%  (p=0.008 n=5+5)

name                                                                                                    old alloc/op   new alloc/op   delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8        45.7MB ± 0%    45.7MB ± 0%     ~     (p=0.690 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8         239MB ± 1%     247MB ± 2%   +3.32%  (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8        50.9MB ± 2%    51.7MB ± 3%     ~     (p=0.421 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8     333MB ± 1%     333MB ± 1%     ~     (p=0.690 n=5+5)

name                                                                                                    old allocs/op  new allocs/op  delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8          545k ± 0%      545k ± 0%     ~     (p=1.000 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8         2.49M ± 0%     2.49M ± 0%   +0.06%  (p=0.016 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8          477k ± 0%      478k ± 0%     ~     (p=0.310 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8     3.73M ± 0%     3.73M ± 0%     ~     (p=0.690 n=5+5)

A+B: faster and lower memory.

LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8         345ms ± 2%     344ms ± 1%     ~     (p=0.690 n=5+5)LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8         523ms ± 2%     546ms ± 2%   +4.42%  (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8         214ms ±16%     206ms ± 1%     ~     (p=0.690 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8     2.19s ± 1%     2.11s ± 1%   -3.41%  (p=0.008 n=5+5)

name                                                                                                    old alloc/op   new alloc/op   delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8        45.7MB ± 0%    45.6MB ± 0%   -0.21%  (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8         239MB ± 1%     241MB ± 1%     ~     (p=0.421 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8        50.9MB ± 2%    52.2MB ± 4%     ~     (p=0.222 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8     333MB ± 1%     300MB ± 0%   -9.81%  (p=0.008 n=5+5)

name                                                                                                    old allocs/op  new allocs/op  delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8          545k ± 0%      545k ± 0%     ~     (p=0.690 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8         2.49M ± 0%     2.49M ± 0%   -0.10%  (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8          477k ± 0%      485k ± 1%   +1.57%  (p=0.008 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8     3.73M ± 0%     3.71M ± 0%   -0.55%  (p=0.008 n=5+5)

However the benefit is much clearer when we load a real 5GB WAL - it needs 19% less CPU and 24% less RAM:

/bin/time before: 172.56user 3.05system 0:27.96elapsed 628%CPU (0avgtext+0avgdata 3840124maxresident)k
           after: 139.75user 2.27system 0:24.07elapsed 590%CPU (0avgtext+0avgdata 2914504maxresident)k

LoadFileWAL-8        27.5s ± 2%     23.7s ± 1%  -13.62%  (p=0.008 n=5+5)
LoadFileWAL-8       5.05GB ± 0%    3.18GB ± 0%  -37.05%  (p=0.008 n=5+5)
LoadFileWAL-8        50.5M ± 0%     49.1M ± 0%   -2.82%  (p=0.008 n=5+5)

All benchmarks run on:

goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/tsdb
cpu: Intel(R) Xeon(R) CPU E3-1240 v6 @ 3.70GHz

(It is disappointing that we end up carrying the max timestamp around for the life of the program even though we only use it at startup, but I couldn't see an easy way around that.)

Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but you need a rebase for couple of thing :)

  • head.go was split into multiple files, so processWALSamples and loadWAL are in head_wal.go
  • The place where you are caching mmMaxTime in loadWAL recently had some update there, so you will need to cache it at 2 different places now.

@bboreham
Copy link
Member Author

bboreham commented Aug 5, 2021

I rebased and moved the changes to match #9147, then added two commits for @codesome's suggestions.

Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Since you have got CI failure to fix, I have left few more nits for the comments to go along with it :)

@bboreham
Copy link
Member Author

bboreham commented Aug 5, 2021

It's failing a WAL corruption unit test now, which I shall attempt to debug through, but if the answer is instantly apparent to someone with more familiarity in tsdb please shout.

@codesome
Copy link
Member

codesome commented Aug 6, 2021

If you don't get to it by next week, I will investigate the failure on Tuesday/Wednesday next week

@bboreham
Copy link
Member Author

bboreham commented Aug 7, 2021

The failure was due to skipping mint/maxt in processWALSamples().
I changed things around so the head's min and max times reflect all mmapped chunks added up to that point.
This likely does more atomic operations, but I think that is justified by the flow being easier to follow.

Incidentally I question why memSeries.maxTime() does not look at mmapped chunks.
Also why memSeries.minTime() returns MinInt64 as a fallback - elsewhere the fallback for min time is MaxInt64.

@codesome
Copy link
Member

I question why memSeries.maxTime() does not look at mmapped chunks.

There will always be a head chunk with at least 1 sample (we m-map a chunk when we are going to add a sample to the head chunk that will be newly created). But that is only when things are running normal, if there was WAL corruption and no head chunk exists, then we should actually look at m-map chunks too :) so, something to fix here.

Also why memSeries.minTime() returns MinInt64 as a fallback - elsewhere the fallback for min time is MaxInt64.

Not sure, need to check the usage

Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But there is a conflict again :(

So that goroutines are stopped and resources released

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Series are distributed across workers by taking the modulus of the
ID with the number of workers, so multiples of 100 are a poor choice.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Real Prometheus cuts chunks every 120 samples, then skips those samples
when re-reading the WAL. Simulate this by creating a single mapped chunk
for each series, since the max time is all the reader looks at.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
The locks that is commented to reduce contention in are now sharded
32,000 ways, so won't be contended. Removing the map saves memory and
goes just as fast.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
So we can skip calling append() for samples it will reject.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Including refactor to extract function `setMMappedChunks`, to reduce
code duplication.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
This ensures we have the correct values if no WAL samples are added for
that series.

Note that `mSeries.maxTime()` was always `math.MinInt64` before, since
that function doesn't consider mmapped chunks.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
@bboreham
Copy link
Member Author

Rebased.

@codesome codesome merged commit 040ef17 into prometheus:main Aug 10, 2021
codesome added a commit that referenced this pull request Aug 11, 2021
* Fix `kuma_sd` targetgroup reporting (#9157)

* Bundle all xDS targets into a single group

Signed-off-by: austin ce <austin.cawley@gmail.com>

* Snapshot in-memory chunks on shutdown for faster restarts (#7229)

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Rename links

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Remove Individual Data Type Caps in Per-shard Buffering for Remote Write (#8921)

* Moved everything to nPending buffer

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Simplify exemplar capacity addition

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Added pre-allocation

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Don't allocate if not sending exemplars

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Avoid deadlock when processing duplicate series record (#9170)

* Avoid deadlock when processing duplicate series record

`processWALSamples()` needs to be able to send on its output channel
before it can read the input channel, so reads to allow this in case the
output channel is full.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* processWALSamples: update comment

Previous text seems to relate to an earlier implementation.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Optimise WAL loading by removing extra map and caching min-time (#9160)

* BenchmarkLoadWAL: close WAL after use

So that goroutines are stopped and resources released

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* BenchmarkLoadWAL: make series IDs co-prime with #workers

Series are distributed across workers by taking the modulus of the
ID with the number of workers, so multiples of 100 are a poor choice.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* BenchmarkLoadWAL: simulate mmapped chunks

Real Prometheus cuts chunks every 120 samples, then skips those samples
when re-reading the WAL. Simulate this by creating a single mapped chunk
for each series, since the max time is all the reader looks at.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Fix comment

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Remove series map from processWALSamples()

The locks that is commented to reduce contention in are now sharded
32,000 ways, so won't be contended. Removing the map saves memory and
goes just as fast.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* loadWAL: Cache the last mmapped chunk time

So we can skip calling append() for samples it will reject.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Improvements from code review

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Full stops and capitals on comments

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Cache max time in both places mmappedChunks is updated

Including refactor to extract function `setMMappedChunks`, to reduce
code duplication.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Update head min/max time when mmapped chunks added

This ensures we have the correct values if no WAL samples are added for
that series.

Note that `mSeries.maxTime()` was always `math.MinInt64` before, since
that function doesn't consider mmapped chunks.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Split Go and React Tests (#8897)

* Added go-ci and react-ci

Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu>
Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Remove search keymap from new expression editor (#9184)

Signed-off-by: Julius Volz <julius.volz@gmail.com>

Co-authored-by: Austin Cawley-Edwards <austin.cawley@gmail.com>
Co-authored-by: Levi Harrison <git@leviharrison.dev>
Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu>
Co-authored-by: Bryan Boreham <bjboreham@gmail.com>
Co-authored-by: Julius Volz <julius.volz@gmail.com>
@pracucci pracucci mentioned this pull request Aug 25, 2021
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Idea to speed up WAL loading

2 participants