Optimise WAL loading by removing extra map and caching min-time by bboreham · Pull Request #9160 · prometheus/prometheus

bboreham · 2021-08-05T10:10:26Z

This PR replaces #8645; it includes the first four commits which address benchmark issues at #9101.
Fixes #8638.

Two code changes:

A: Remove the additional refSeries cache built in processWALSamples(). "Mitigate lock contention in getByID", it said, but the locks are sharded so with any likely number of workers there is no contention.
B: Cache the highest timestamp of any mmapped chunk for a series, so that we avoid re-looking this up for every sample. We cache it in the main memSeries struct.

I spent a lot of time exploring different options, and concluded this is a good combination.
Benchmarking them both separately:

Just A: memory is reduced when we have a lot of series, but it runs a little slower.

name                                                                                                    old time/op    new time/op    delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8         345ms ± 2%     345ms ± 1%     ~     (p=0.730 n=5+4)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8         523ms ± 2%     549ms ± 3%   +5.03%  (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8         214ms ±16%     208ms ± 0%     ~     (p=0.730 n=5+4)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8     2.19s ± 1%     2.22s ± 1%   +1.54%  (p=0.008 n=5+5)

name                                                                                                    old alloc/op   new alloc/op   delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8        45.7MB ± 0%    45.6MB ± 0%     ~     (p=0.056 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8         239MB ± 1%     241MB ± 1%     ~     (p=0.690 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8        50.9MB ± 2%    52.2MB ± 1%   +2.50%  (p=0.032 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8     333MB ± 1%     302MB ± 1%   -9.11%  (p=0.008 n=5+5)

name                                                                                                    old allocs/op  new allocs/op  delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8          545k ± 0%      545k ± 0%     ~     (p=1.000 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8         2.49M ± 0%     2.49M ± 0%   -0.10%  (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8          477k ± 0%      486k ± 1%   +1.90%  (p=0.008 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8     3.73M ± 0%     3.71M ± 0%   -0.48%  (p=0.008 n=5+5)

Just B: a bit faster, slightly higher memory use.

name                                                                                                    old time/op    new time/op    delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8         345ms ± 2%     337ms ± 1%   -2.45%  (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8         523ms ± 2%     529ms ± 2%     ~     (p=0.222 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8         214ms ±16%     193ms ± 1%   -9.90%  (p=0.016 n=5+4)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8     2.19s ± 1%     2.11s ± 1%   -3.57%  (p=0.008 n=5+5)

name                                                                                                    old alloc/op   new alloc/op   delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8        45.7MB ± 0%    45.7MB ± 0%     ~     (p=0.690 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8         239MB ± 1%     247MB ± 2%   +3.32%  (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8        50.9MB ± 2%    51.7MB ± 3%     ~     (p=0.421 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8     333MB ± 1%     333MB ± 1%     ~     (p=0.690 n=5+5)

name                                                                                                    old allocs/op  new allocs/op  delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8          545k ± 0%      545k ± 0%     ~     (p=1.000 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8         2.49M ± 0%     2.49M ± 0%   +0.06%  (p=0.016 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8          477k ± 0%      478k ± 0%     ~     (p=0.310 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8     3.73M ± 0%     3.73M ± 0%     ~     (p=0.690 n=5+5)

A+B: faster and lower memory.

LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8         345ms ± 2%     344ms ± 1%     ~     (p=0.690 n=5+5)LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8         523ms ± 2%     546ms ± 2%   +4.42%  (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8         214ms ±16%     206ms ± 1%     ~     (p=0.690 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8     2.19s ± 1%     2.11s ± 1%   -3.41%  (p=0.008 n=5+5)

name                                                                                                    old alloc/op   new alloc/op   delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8        45.7MB ± 0%    45.6MB ± 0%   -0.21%  (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8         239MB ± 1%     241MB ± 1%     ~     (p=0.421 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8        50.9MB ± 2%    52.2MB ± 4%     ~     (p=0.222 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8     333MB ± 1%     300MB ± 0%   -9.81%  (p=0.008 n=5+5)

name                                                                                                    old allocs/op  new allocs/op  delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8          545k ± 0%      545k ± 0%     ~     (p=0.690 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8         2.49M ± 0%     2.49M ± 0%   -0.10%  (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8          477k ± 0%      485k ± 1%   +1.57%  (p=0.008 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8     3.73M ± 0%     3.71M ± 0%   -0.55%  (p=0.008 n=5+5)

However the benefit is much clearer when we load a real 5GB WAL - it needs 19% less CPU and 24% less RAM:

/bin/time before: 172.56user 3.05system 0:27.96elapsed 628%CPU (0avgtext+0avgdata 3840124maxresident)k
           after: 139.75user 2.27system 0:24.07elapsed 590%CPU (0avgtext+0avgdata 2914504maxresident)k

LoadFileWAL-8        27.5s ± 2%     23.7s ± 1%  -13.62%  (p=0.008 n=5+5)
LoadFileWAL-8       5.05GB ± 0%    3.18GB ± 0%  -37.05%  (p=0.008 n=5+5)
LoadFileWAL-8        50.5M ± 0%     49.1M ± 0%   -2.82%  (p=0.008 n=5+5)

All benchmarks run on:

goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/tsdb
cpu: Intel(R) Xeon(R) CPU E3-1240 v6 @ 3.70GHz

(It is disappointing that we end up carrying the max timestamp around for the life of the program even though we only use it at startup, but I couldn't see an easy way around that.)

codesome

LGTM, but you need a rebase for couple of thing :)

head.go was split into multiple files, so processWALSamples and loadWAL are in head_wal.go
The place where you are caching mmMaxTime in loadWAL recently had some update there, so you will need to cache it at 2 different places now.

tsdb/head.go

bboreham · 2021-08-05T11:11:26Z

I rebased and moved the changes to match #9147, then added two commits for @codesome's suggestions.

codesome

LGTM! Since you have got CI failure to fix, I have left few more nits for the comments to go along with it :)

tsdb/head.go

tsdb/head_test.go

tsdb/head_wal.go

bboreham · 2021-08-05T19:07:09Z

It's failing a WAL corruption unit test now, which I shall attempt to debug through, but if the answer is instantly apparent to someone with more familiarity in tsdb please shout.

codesome · 2021-08-06T13:05:53Z

If you don't get to it by next week, I will investigate the failure on Tuesday/Wednesday next week

bboreham · 2021-08-07T15:01:06Z

The failure was due to skipping mint/maxt in processWALSamples().
I changed things around so the head's min and max times reflect all mmapped chunks added up to that point.
This likely does more atomic operations, but I think that is justified by the flow being easier to follow.

Incidentally I question why memSeries.maxTime() does not look at mmapped chunks.
Also why memSeries.minTime() returns MinInt64 as a fallback - elsewhere the fallback for min time is MaxInt64.

codesome · 2021-08-10T08:13:54Z

I question why memSeries.maxTime() does not look at mmapped chunks.

There will always be a head chunk with at least 1 sample (we m-map a chunk when we are going to add a sample to the head chunk that will be newly created). But that is only when things are running normal, if there was WAL corruption and no head chunk exists, then we should actually look at m-map chunks too :) so, something to fix here.

Also why memSeries.minTime() returns MinInt64 as a fallback - elsewhere the fallback for min time is MaxInt64.

Not sure, need to check the usage

codesome

LGTM. But there is a conflict again :(

So that goroutines are stopped and resources released Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Series are distributed across workers by taking the modulus of the ID with the number of workers, so multiples of 100 are a poor choice. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Real Prometheus cuts chunks every 120 samples, then skips those samples when re-reading the WAL. Simulate this by creating a single mapped chunk for each series, since the max time is all the reader looks at. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

The locks that is commented to reduce contention in are now sharded 32,000 ways, so won't be contended. Removing the map saves memory and goes just as fast. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

So we can skip calling append() for samples it will reject. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Including refactor to extract function `setMMappedChunks`, to reduce code duplication. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

This ensures we have the correct values if no WAL samples are added for that series. Note that `mSeries.maxTime()` was always `math.MinInt64` before, since that function doesn't consider mmapped chunks. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

bboreham · 2021-08-10T08:47:33Z

Rebased.

* Fix `kuma_sd` targetgroup reporting (#9157) * Bundle all xDS targets into a single group Signed-off-by: austin ce <austin.cawley@gmail.com> * Snapshot in-memory chunks on shutdown for faster restarts (#7229) Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> * Rename links Signed-off-by: Levi Harrison <git@leviharrison.dev> * Remove Individual Data Type Caps in Per-shard Buffering for Remote Write (#8921) * Moved everything to nPending buffer Signed-off-by: Levi Harrison <git@leviharrison.dev> * Simplify exemplar capacity addition Signed-off-by: Levi Harrison <git@leviharrison.dev> * Added pre-allocation Signed-off-by: Levi Harrison <git@leviharrison.dev> * Don't allocate if not sending exemplars Signed-off-by: Levi Harrison <git@leviharrison.dev> * Avoid deadlock when processing duplicate series record (#9170) * Avoid deadlock when processing duplicate series record `processWALSamples()` needs to be able to send on its output channel before it can read the input channel, so reads to allow this in case the output channel is full. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * processWALSamples: update comment Previous text seems to relate to an earlier implementation. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Optimise WAL loading by removing extra map and caching min-time (#9160) * BenchmarkLoadWAL: close WAL after use So that goroutines are stopped and resources released Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * BenchmarkLoadWAL: make series IDs co-prime with #workers Series are distributed across workers by taking the modulus of the ID with the number of workers, so multiples of 100 are a poor choice. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * BenchmarkLoadWAL: simulate mmapped chunks Real Prometheus cuts chunks every 120 samples, then skips those samples when re-reading the WAL. Simulate this by creating a single mapped chunk for each series, since the max time is all the reader looks at. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Fix comment Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Remove series map from processWALSamples() The locks that is commented to reduce contention in are now sharded 32,000 ways, so won't be contended. Removing the map saves memory and goes just as fast. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * loadWAL: Cache the last mmapped chunk time So we can skip calling append() for samples it will reject. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Improvements from code review Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Full stops and capitals on comments Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Cache max time in both places mmappedChunks is updated Including refactor to extract function `setMMappedChunks`, to reduce code duplication. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Update head min/max time when mmapped chunks added This ensures we have the correct values if no WAL samples are added for that series. Note that `mSeries.maxTime()` was always `math.MinInt64` before, since that function doesn't consider mmapped chunks. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Split Go and React Tests (#8897) * Added go-ci and react-ci Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu> Signed-off-by: Levi Harrison <git@leviharrison.dev> * Remove search keymap from new expression editor (#9184) Signed-off-by: Julius Volz <julius.volz@gmail.com> Co-authored-by: Austin Cawley-Edwards <austin.cawley@gmail.com> Co-authored-by: Levi Harrison <git@leviharrison.dev> Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu> Co-authored-by: Bryan Boreham <bjboreham@gmail.com> Co-authored-by: Julius Volz <julius.volz@gmail.com>

bboreham requested a review from codesome as a code owner August 5, 2021 10:10

bboreham mentioned this pull request Aug 5, 2021

Optimise WAL loading by caching min-time #8645

Closed

codesome reviewed Aug 5, 2021

View reviewed changes

tsdb/head.go Outdated Show resolved Hide resolved

tsdb/head.go Outdated Show resolved Hide resolved

bboreham force-pushed the cache-wal-min-3 branch from 0c509e6 to ce1229e Compare August 5, 2021 11:02

codesome approved these changes Aug 5, 2021

View reviewed changes

tsdb/head.go Outdated Show resolved Hide resolved

tsdb/head_test.go Outdated Show resolved Hide resolved

tsdb/head_test.go Outdated Show resolved Hide resolved

tsdb/head_wal.go Outdated Show resolved Hide resolved

bboreham force-pushed the cache-wal-min-3 branch from 25392e3 to 3858117 Compare August 5, 2021 13:32

bboreham force-pushed the cache-wal-min-3 branch from ba9d074 to b6cdf1f Compare August 7, 2021 10:10

codesome approved these changes Aug 10, 2021

View reviewed changes

bboreham added 10 commits August 10, 2021 08:41

BenchmarkLoadWAL: close WAL after use

f48ae15

So that goroutines are stopped and resources released Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

BenchmarkLoadWAL: make series IDs co-prime with #workers

bf106ca

Series are distributed across workers by taking the modulus of the ID with the number of workers, so multiples of 100 are a poor choice. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Fix comment

da1d4d3

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Remove series map from processWALSamples()

d5ae844

The locks that is commented to reduce contention in are now sharded 32,000 ways, so won't be contended. Removing the map saves memory and goes just as fast. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

loadWAL: Cache the last mmapped chunk time

c44a2b0

So we can skip calling append() for samples it will reject. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Improvements from code review

e343947

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Full stops and capitals on comments

39962cb

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Cache max time in both places mmappedChunks is updated

b59216b

Including refactor to extract function `setMMappedChunks`, to reduce code duplication. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

bboreham force-pushed the cache-wal-min-3 branch from b6cdf1f to fdd24c0 Compare August 10, 2021 08:47

codesome merged commit 040ef17 into prometheus:main Aug 10, 2021

pracucci mentioned this pull request Aug 25, 2021

Upgrade prometheus grafana/mimir#174

Merged

3 tasks

bboreham mentioned this pull request Sep 5, 2021

Memory usage spikes during WAL replay to more than normal usage #6934

Closed

bboreham mentioned this pull request Jul 2, 2024

Problems with BenchmarkLoadWAL #9101

Closed

bboreham mentioned this pull request Jul 15, 2024

Turn memSeries.mmMaxTime into a method #14472

Closed

colega mentioned this pull request Jul 16, 2024

Cache mmMaxTime in a map instead of memSeries #14473

Closed

bboreham added the kind/optimization label Jul 21, 2024

colega mentioned this pull request Jul 30, 2024

Store mmMaxTime in same field as seriesShard #14525

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise WAL loading by removing extra map and caching min-time#9160

Optimise WAL loading by removing extra map and caching min-time#9160
codesome merged 10 commits intoprometheus:mainfrom
bboreham:cache-wal-min-3

bboreham commented Aug 5, 2021 •

edited

Loading

Uh oh!

codesome left a comment

Uh oh!

Uh oh!

Uh oh!

bboreham commented Aug 5, 2021

Uh oh!

codesome left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bboreham commented Aug 5, 2021

Uh oh!

codesome commented Aug 6, 2021

Uh oh!

bboreham commented Aug 7, 2021

Uh oh!

codesome commented Aug 10, 2021

Uh oh!

codesome left a comment

Uh oh!

bboreham commented Aug 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bboreham commented Aug 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codesome left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bboreham commented Aug 5, 2021

Uh oh!

codesome left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bboreham commented Aug 5, 2021

Uh oh!

codesome commented Aug 6, 2021

Uh oh!

bboreham commented Aug 7, 2021

Uh oh!

codesome commented Aug 10, 2021

Uh oh!

codesome left a comment

Choose a reason for hiding this comment

Uh oh!

bboreham commented Aug 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bboreham commented Aug 5, 2021 •

edited

Loading