Optimise WAL loading by caching min-time#8645
Optimise WAL loading by caching min-time#8645bboreham wants to merge 6 commits intoprometheus:mainfrom
Conversation
pracucci
left a comment
There was a problem hiding this comment.
The change makes sense to me. What does BenchmarkLoadWAL() says (eg. running with -count=5 and then comparing with benchstat?
|
As far as I can see I think I need some advice on how to set up a benchmark with both WAL samples and mmapped chunks. |
|
I spent some time drilling into the behaviour of the code, figured out how to create some mmapped chunks in |
Prolly not |
So that goroutines are stopped and resources released Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Series are distributed across workers by taking the modulus of the ID with the number of workers, so multiples of 100 are a poor choice. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Real Prometheus cuts chunks every 120 samples, then skips those samples when re-reading the WAL. Simulate this by creating a single mapped chunk for each series, since the max time is all the reader looks at. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Profiling shows that computing the last timestamp of mapped chunks inside `append()` takes significant CPU time, so compute and store it once per series. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
|
I have added a new option to Headline benchmark numbers say not much has changed: however if we look at the amount of CPU used during the test with simulated mmapped chunks we see about a 10% improvement, in line with what the profile showed at #8638: Timings of loading a real 5GB WAL: |
cstyan
left a comment
There was a problem hiding this comment.
Can we leave a comment somewhere to the effect that skipping samples that we know will be rejected in append is the optimization you're making around line 468/469. It's not immediately clear that caching of minTime is that important without looking at the linked issue.
|
I started to re-word the comment, then realised the whole basis for this cache is incorrect. "Mitigate lock contention in getByID", it said, but the locks are sharded 32,768 ways so with any likely number of workers there is no contention. Removing |
Interesting. |
|
Replaced by #9160 |
Fixes #8638
Profiling shows that computing the last timestamp of mapped chunks inside
append()takes ~10% of CPU time, so compute and store it once per series.Edit: also fix some issues in the benchmark.