Make samplesPerChunk configurable by leizor · Pull Request #12055 · prometheus/prometheus

leizor · 2023-03-03T21:09:23Z

This PR suggests a new value for samplesPerChunk based on some observations/assumptions about the average dataset:

>95% of timestamps have uniform periods, resulting in a dod of 0.
~60% of values are the same as the previous value, resulting in an xor of 0.
~40% of values can re-use the same “interesting bits” as the previous value, so we can use the 10 control bits.
The average length of interesting bits of a value is ~8 bits.
99.9% of the first samples of a chunk used a 2-byte uvarint to encode the ts delta (16 bits).

Therefore, this equation (n = number of samples) for expected size per chunk covers 60% of the cases:

S = (n-1) * (1+1)  // Size of samples after the first sample.
+ 16 + 64          // Size of first sample (2 bytes for ts delta + 64 bytes for float val).
+ 64               // 64 bit header w/ timestamp.

= 2n + 142

The other 40% is covered by this:

S = (n-1) * (1 + (2+8))  // Size of samples after first sample (1 bit for ts dod, 2 control 
                         // bits + 8 bit average xor length for value)
+ 16 + 64 + 64

= 11n + 133

Combining the two cases covers nearly 100% of the cases:

// Size after compression as a function of # of samples (n)
Sc = (2n + 142) * .6 + (11n + 133) * .4

= 5.6n + 138.4

This allows us to arrive at an equation to describe the expected compression ratio:

CR = 128n / (5.6n + 138.4)

That means that the compression ratio approaches an asymptote of 128 / 5.6 = 22.86. If we target achieving 90% of the maximum expected compression ratio, we'd be targeting 22.86 * .9 = 20.574. Solving for n, we arrive at 223 samples per chunk, which is 1388 bits or 174 bytes.

As a point of comparison, the current samples per chunk value yields ~83% of the maximum expected compression ratio with an expected chunk size of 811 bits or 102 bytes.

bboreham · 2023-03-08T10:23:48Z

/prombench main

prombot · 2023-03-08T10:23:49Z

@bboreham is not a org member nor a collaborator and cannot execute benchmarks.

roidelapluie · 2023-03-08T11:09:02Z

/prombench main

prombot · 2023-03-08T11:09:04Z

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-12055 and main

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

prombot · 2023-03-11T11:13:36Z

Benchmark tests are running for 3 days! If this is intended ignore this message otherwise you can cancel it by commenting: /prombench cancel

bboreham · 2023-03-11T11:40:26Z

It appears to use slightly more memory:

and slightly less disk space:

There are fewer chunks in memory, as expected:

@leizor what metric(s) were you looking to show an improvement on?

(Also it seems Prombench is executing no queries. That seems like a crucial omission)

leizor · 2023-03-11T18:36:07Z

I think saving a bit of disk space in exchange for using slightly more memory is pretty much what I expected.

It's a bummer about the queries though because I suspect that this change might also result in slightly longer query latencies since each chunk is larger (I was hoping it'd not be the case though). Is it something in my PR that's making Prombench not execute any queries?

prombot · 2023-03-13T11:18:38Z

Benchmark tests are running for 5 days! If this is intended ignore this message otherwise you can cancel it by commenting: /prombench cancel

leizor · 2023-03-13T16:23:08Z

/prombench cancel

prombot · 2023-03-13T16:23:10Z

@leizor is not a org member nor a collaborator and cannot execute benchmarks.

leizor · 2023-03-13T16:25:39Z

Converting this PR to a draft; I'm going to take a stab at forgoing sample count limits in favor of chunk size limits here instead.

beorn7 · 2023-03-14T15:24:08Z

Any ideas why it uses more RAM? With fewer chunks and a better compression ratio, I would expect less RAM usage. If at all, I would expect more CPU usage and longer query times (because we have to iterate through chunks with more samples).

roidelapluie · 2023-03-15T08:34:42Z

/prombench cancel

prombot · 2023-03-15T08:34:44Z

Benchmark cancel is in progress.

leizor · 2023-03-15T13:22:14Z

Any ideas why it uses more RAM? With fewer chunks and a better compression ratio, I would expect less RAM usage. If at all, I would expect more CPU usage and longer query times (because we have to iterate through chunks with more samples).

Hmm good point. I was thinking it'd use more RAM because it'd be holding more samples in memory at any given time, but I think you're right--the number of samples should actually still be the same, just in fewer chunks.

leizor · 2023-03-23T18:31:00Z

@beorn7 Actually, I stumbled across this today (from here):

Since Prometheus v2.19.0, we are not storing all the chunks in the memory. As soon as a new chunk is cut, the full chunk is flushed to the disk and memory-mapped from the disk while only storing a reference in the memory.

If a chunk is flushed to disk as soon as a new one is cut (instead of only when the block is done), larger chunks would indeed result in increased memory usage, right?

EDIT: Looking a bit more closely, I think the code matches that behavior (after the head chunk is mmapped, s.headChunk points to a new &memChunk).

bboreham · 2023-03-24T10:18:19Z

Good point.

Small note: if the data is queried it will be brought back into memory. So it’s relevant that Prombench was not running any queries.
(However it’s likely that data collected will be much more than data queried, so this is not a huge effect)

bwplotka · 2023-03-24T11:19:08Z

Side note: If we are considering changing that number - what about making this (e.g. hidden for now) flag?

beorn7 · 2023-03-27T16:55:43Z

Yeah, sure, if we end up with tweaking the number of samples per chunk, it could be a hidden flag for the time being.

But maybe going down the path of trying to set a target in bytes rather than samples is anyway better. (Which then, of course, might be a (hidden) flag, too.)

Signed-off-by: Justin Lei <justin.lei@grafana.com>

leizor · 2023-04-06T17:16:15Z

Do we want to add the flag in this PR or defer to later one?

beorn7 · 2023-04-11T12:11:42Z

I would say, let's put it into a (hidden) flag here in this PR (similar to --storage.tsdb.min-block-duration and friends, e.g. --storage.tsdb.max-samples-per-chunk).

With the flags, users can experiment, and based on the experiences, we can then set a new default limit that's generally better (or maybe we go down the path of limiting the number of bytes after all).

But if we hardcode a different limit now, and then it turns out it's detrimental in some cases, it will create a needless confusion.

bwplotka · 2023-04-11T12:28:27Z

I would say we need a flag.

Signed-off-by: Justin Lei <justin.lei@grafana.com>

leizor · 2023-04-13T23:04:11Z

Ok, I've added the new flag storage.tsdb.samples-per-chunk and defaulted it to the original value of 120 (I think that's what we landed on).

beorn7

I think that works.

@roidelapluie @bwplotka @jesusvazquez any last words? Otherwise I'll merge tomorrow.

jesusvazquez

LGTM, nice job @leizor

I am also uncertain about why memory slightly increased with 220 samples, would be good to look into that.

Also good call on adding the flag.

beorn7 · 2023-05-10T12:45:07Z

Thanks, merging now…

colega · 2023-05-24T12:45:49Z

I've got two notes about this:

One is regarding the memory usage: for which I guess one of the reasons are the 8 extra bytes of that setting we're storing per memSeries.

What was the reason to do this setting per memSeries? Is it to make sure that the 1/4th chunk estimation works correctly if the setting is changed in the Head? But this doesn't sound like a setting someone would want to change in runtime, and even if they did, the side-effect of having uneven chunks once doesn't seem to be an argument to store that integer in every single series.

The other one is regarding how that strange amount of 220 samples (which isn't a multiple of 15, 30, 60: the usual Prometheus scraping intervals) really works with the chunk size estimation:

prometheus/tsdb/head_append.go

Lines 1371 to 1377 in 37e5249

    
           // If we reach 25% of a chunk's desired sample count, predict an end time 
        
           // for this chunk that will try to make samples equally distributed within 
        
           // the remaining chunks in the current chunk range. 
        
           // At latest it must happen at the timestamp set when the chunk was cut. 
        
           if numSamples == s.samplesPerChunk/4 { 
        
           	s.nextAt = computeChunkEndTime(c.minTime, c.maxTime, s.nextAt) 
        
           }

I was wondering if the estimation algorithm would maybe try to issue 3 x 160-samples chunks for a 15s scrape interval (respecting the 220 maximum, making it useless)? But after writing some tests I see that setting this at 220, we'll actually issue just 240-samples chunks:

diff --git tsdb/head_test.go tsdb/head_test.go
index af3df378e..7b69f11c1 100644
--- tsdb/head_test.go
+++ tsdb/head_test.go
@@ -1318,6 +1318,20 @@ func TestComputeChunkEndTime(t *testing.T) {
                        max:   1000,
                        res:   104,
                },
+               {
+                       // Test 15s scrape inteval with 220 samples per chunk.
+                       start: 0,                // Start at 0 to make it easier.
+                       cur:   15e3 * (220 / 4), // We're at 1/4th after receiving 220/4 samples each 15s.
+                       max:   2 * 60 * 60e3,    // 2h
+                       res:   1 * 60 * 60e3,    // 1h, i.e. 2-240 samples chunks per hour
+               },
+               {
+                       // Test 30s scrape inteval with 220 samples per chunk.
+                       start: 0,                // Start at 0 to make it easier.
+                       cur:   30e3 * (220 / 4), // We're at 1/4th after receiving 220/4 samples each 30s.
+                       max:   2 * 60 * 60e3,    // 2h
+                       res:   2 * 60 * 60e3,    // 4, i.e. 1 240-samples chunk.
+               },
        }
 
        for _, c := range cases {

Maybe we should change the setting to 240 then? I know this is specific to the default 2h blocks, but I don't think anyone runs 110-minutes blocks, and it's strange to see chunks issued with a different size compared to what was set.

beorn7 · 2023-05-24T13:16:25Z

Very good point about storing the setting per memSeries. That seems unneeded indeed. Even if it were not fully responsible for the memory increase, it is still confusing conceptually (because it suggests that different memSeries might have a different setting). @leizor would you like to create an amendment PR?

Also good thoughts about the exact value of samplesPerChunk and its consequences for the actual chunk size. Note that the final version of this PR does not set the value to 220 but leaves it at 120 with the option of setting it via the command line. So we are good on this front, but users of the new flag should be aware of the interdependencies of the different TSDB flags. (They are all hidden advanced flags, so I'm not too concerned right now that it is hard to set them to something meaningful.)

leizor · 2023-05-24T16:57:09Z

@beorn7 Yep, will do!

alanprot · 2023-05-24T18:02:45Z

Should we only set the config if its != 0 here?

Just got a divided by 0 error when updating prometheus on thanos! :D

thanos-io/thanos#6392

prometheus/tsdb/db.go

Line 790 in 37e5249

headOpts.SamplesPerChunk = opts.SamplesPerChunk

nervemind.. just saw the #12387

leizor · 2023-05-24T20:18:26Z

@colega

What was the reason to do this setting per memSeries?

It was just a way to pass the information down to where it was needed; we need to know what the value of samplesPerChunk is in appendPreprocessor of memSeries but memSeries doesn't have a reference to Head or anything to get that information otherwise.

I think I'm going to give memSeries a reference to HeadOptions instead. Unfortunately it'll use the same extra 8 bytes per memSeries but I figure at least it wouldn't imply that the value could be different between memSeries.

EDIT: Actually, I think I'll add samplesPerChunk as a parameter to append and appendPrepocessor and pass it in that way.

beorn7 · 2023-05-31T15:41:56Z

Updated the PR title to not cause confusion in the future.

codesome · 2023-06-01T13:26:46Z

tsdb/head.go

 	mmMaxTime int64 // Max time of any mmapped chunk, only used during WAL replay.

-	nextAt int64 // Timestamp at which to cut the next chunk.
+	samplesPerChunk int   // Target number of samples per chunk.


We should probably make this uint16 and have validation during init to keep it within that range. Saves 48 bytes per series.

We ended up pulling that out in #12390.

codesome · 2025-06-09T23:26:09Z

I am wondering if setting it to a higher number gets hit by #8582. I see that the logic still exists.

leizor force-pushed the leizor/prometheus/issues/12009 branch from 4b374a8 to 33409d1 Compare March 6, 2023 21:10

leizor marked this pull request as ready for review March 6, 2023 21:39

leizor requested review from bwplotka, codesome, csmarchbanks, cstyan and tomwilkie as code owners March 6, 2023 21:39

prombot added the prombench label Mar 8, 2023

leizor marked this pull request as draft March 13, 2023 16:24

beorn7 mentioned this pull request Mar 14, 2023

Add a chunk size limit in bytes #12054

Merged

Adjust samplesPerChunk from 120 to 220

79db04e

Signed-off-by: Justin Lei <justin.lei@grafana.com>

leizor force-pushed the leizor/prometheus/issues/12009 branch from 33409d1 to fd08473 Compare April 6, 2023 16:23

Test fixes

73ff91d

Signed-off-by: Justin Lei <justin.lei@grafana.com>

leizor force-pushed the leizor/prometheus/issues/12009 branch from fd08473 to 73ff91d Compare April 6, 2023 16:43

leizor marked this pull request as ready for review April 6, 2023 17:15

leizor force-pushed the leizor/prometheus/issues/12009 branch from fbacc6d to ab4af48 Compare April 13, 2023 21:45

leizor added 2 commits April 13, 2023 15:59

Add storage.tsdb.samples-per-chunk flag

0529934

Signed-off-by: Justin Lei <justin.lei@grafana.com>

Reverse test changes

c3e6b85

Signed-off-by: Justin Lei <justin.lei@grafana.com>

leizor force-pushed the leizor/prometheus/issues/12009 branch from ab4af48 to c3e6b85 Compare April 13, 2023 23:00

beorn7 approved these changes May 9, 2023

View reviewed changes

jesusvazquez approved these changes May 10, 2023

View reviewed changes

beorn7 merged commit 37fe9b8 into prometheus:main May 10, 2023

zenador mentioned this pull request May 24, 2023

Use DefaultSamplesPerChunk in tsdb #12387

Merged

leizor deleted the leizor/prometheus/issues/12009 branch May 24, 2023 16:57

leizor mentioned this pull request May 24, 2023

Remove samplesPerChunk from memSeries #12390

Merged

beorn7 changed the title ~~Adjust samplesPerChunk from 120 to 220~~ Make samplesPerChunk configurable May 31, 2023

codesome reviewed Jun 1, 2023

View reviewed changes

This was referenced Aug 24, 2023

Improve the logic behind cutting a new chunk and cap the size of a chunk #12009

Closed

promtool: Add support for native histograms #12146

Open

Conversation

leizor commented Mar 3, 2023

Uh oh!

bboreham commented Mar 8, 2023 • edited by roidelapluie Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prombot commented Mar 8, 2023

Uh oh!

roidelapluie commented Mar 8, 2023

Uh oh!

prombot commented Mar 8, 2023

Uh oh!

prombot commented Mar 11, 2023

Uh oh!

bboreham commented Mar 11, 2023

Uh oh!

leizor commented Mar 11, 2023

Uh oh!

prombot commented Mar 13, 2023

Uh oh!

leizor commented Mar 13, 2023

Uh oh!

prombot commented Mar 13, 2023

Uh oh!

leizor commented Mar 13, 2023

Uh oh!

beorn7 commented Mar 14, 2023

Uh oh!

roidelapluie commented Mar 15, 2023

Uh oh!

prombot commented Mar 15, 2023

Uh oh!

leizor commented Mar 15, 2023

Uh oh!

leizor commented Mar 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bboreham commented Mar 24, 2023

Uh oh!

bwplotka commented Mar 24, 2023

Uh oh!

beorn7 commented Mar 27, 2023

Uh oh!

leizor commented Apr 6, 2023

Uh oh!

beorn7 commented Apr 11, 2023

Uh oh!

bwplotka commented Apr 11, 2023

Uh oh!

leizor commented Apr 13, 2023

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

jesusvazquez left a comment

Choose a reason for hiding this comment

Uh oh!

beorn7 commented May 10, 2023

Uh oh!

colega commented May 24, 2023

Uh oh!

beorn7 commented May 24, 2023

Uh oh!

leizor commented May 24, 2023

Uh oh!

alanprot commented May 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leizor commented May 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beorn7 commented May 31, 2023

Uh oh!

codesome Jun 1, 2023

Choose a reason for hiding this comment

Uh oh!

leizor Jun 2, 2023

Choose a reason for hiding this comment

Uh oh!

codesome commented Jun 9, 2025

bboreham commented Mar 8, 2023 •

edited by roidelapluie

Loading

leizor commented Mar 23, 2023 •

edited

Loading

alanprot commented May 24, 2023 •

edited

Loading

leizor commented May 24, 2023 •

edited

Loading