Skip to content

Make samplesPerChunk configurable#12055

Merged
beorn7 merged 5 commits intoprometheus:mainfrom
leizor:leizor/prometheus/issues/12009
May 10, 2023
Merged

Make samplesPerChunk configurable#12055
beorn7 merged 5 commits intoprometheus:mainfrom
leizor:leizor/prometheus/issues/12009

Conversation

@leizor
Copy link
Contributor

@leizor leizor commented Mar 3, 2023

This PR suggests a new value for samplesPerChunk based on some observations/assumptions about the average dataset:

  • >95% of timestamps have uniform periods, resulting in a dod of 0.
  • ~60% of values are the same as the previous value, resulting in an xor of 0.
  • ~40% of values can re-use the same “interesting bits” as the previous value, so we can use the 10 control bits.
  • The average length of interesting bits of a value is ~8 bits.
  • 99.9% of the first samples of a chunk used a 2-byte uvarint to encode the ts delta (16 bits).

Therefore, this equation (n = number of samples) for expected size per chunk covers 60% of the cases:

S = (n-1) * (1+1)  // Size of samples after the first sample.
+ 16 + 64          // Size of first sample (2 bytes for ts delta + 64 bytes for float val).
+ 64               // 64 bit header w/ timestamp.

= 2n + 142

The other 40% is covered by this:

S = (n-1) * (1 + (2+8))  // Size of samples after first sample (1 bit for ts dod, 2 control 
                         // bits + 8 bit average xor length for value)
+ 16 + 64 + 64

= 11n + 133

Combining the two cases covers nearly 100% of the cases:

// Size after compression as a function of # of samples (n)
Sc = (2n + 142) * .6 + (11n + 133) * .4

= 5.6n + 138.4

This allows us to arrive at an equation to describe the expected compression ratio:

CR = 128n / (5.6n + 138.4)

That means that the compression ratio approaches an asymptote of 128 / 5.6 = 22.86. If we target achieving 90% of the maximum expected compression ratio, we'd be targeting 22.86 * .9 = 20.574. Solving for n, we arrive at 223 samples per chunk, which is 1388 bits or 174 bytes.

As a point of comparison, the current samples per chunk value yields ~83% of the maximum expected compression ratio with an expected chunk size of 811 bits or 102 bytes.

@leizor leizor force-pushed the leizor/prometheus/issues/12009 branch from 4b374a8 to 33409d1 Compare March 6, 2023 21:10
@leizor leizor marked this pull request as ready for review March 6, 2023 21:39
@bboreham
Copy link
Member

bboreham commented Mar 8, 2023

/prombench main

@prombot
Copy link
Contributor

prombot commented Mar 8, 2023

@bboreham is not a org member nor a collaborator and cannot execute benchmarks.

@roidelapluie
Copy link
Member

/prombench main

@prombot
Copy link
Contributor

prombot commented Mar 8, 2023

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-12055 and main

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

@prombot
Copy link
Contributor

prombot commented Mar 11, 2023

Benchmark tests are running for 3 days! If this is intended ignore this message otherwise you can cancel it by commenting: /prombench cancel

@bboreham
Copy link
Member

It appears to use slightly more memory:
image

and slightly less disk space:
image

There are fewer chunks in memory, as expected:
image

@leizor what metric(s) were you looking to show an improvement on?

(Also it seems Prombench is executing no queries. That seems like a crucial omission)

@leizor
Copy link
Contributor Author

leizor commented Mar 11, 2023

I think saving a bit of disk space in exchange for using slightly more memory is pretty much what I expected.

It's a bummer about the queries though because I suspect that this change might also result in slightly longer query latencies since each chunk is larger (I was hoping it'd not be the case though). Is it something in my PR that's making Prombench not execute any queries?

@prombot
Copy link
Contributor

prombot commented Mar 13, 2023

Benchmark tests are running for 5 days! If this is intended ignore this message otherwise you can cancel it by commenting: /prombench cancel

@leizor
Copy link
Contributor Author

leizor commented Mar 13, 2023

/prombench cancel

@prombot
Copy link
Contributor

prombot commented Mar 13, 2023

@leizor is not a org member nor a collaborator and cannot execute benchmarks.

@leizor leizor marked this pull request as draft March 13, 2023 16:24
@leizor
Copy link
Contributor Author

leizor commented Mar 13, 2023

Converting this PR to a draft; I'm going to take a stab at forgoing sample count limits in favor of chunk size limits here instead.

@beorn7
Copy link
Member

beorn7 commented Mar 14, 2023

Any ideas why it uses more RAM? With fewer chunks and a better compression ratio, I would expect less RAM usage. If at all, I would expect more CPU usage and longer query times (because we have to iterate through chunks with more samples).

@roidelapluie
Copy link
Member

/prombench cancel

@prombot
Copy link
Contributor

prombot commented Mar 15, 2023

Benchmark cancel is in progress.

@leizor
Copy link
Contributor Author

leizor commented Mar 15, 2023

Any ideas why it uses more RAM? With fewer chunks and a better compression ratio, I would expect less RAM usage. If at all, I would expect more CPU usage and longer query times (because we have to iterate through chunks with more samples).

Hmm good point. I was thinking it'd use more RAM because it'd be holding more samples in memory at any given time, but I think you're right--the number of samples should actually still be the same, just in fewer chunks.

@leizor
Copy link
Contributor Author

leizor commented Mar 23, 2023

@beorn7 Actually, I stumbled across this today (from here):

Since Prometheus v2.19.0, we are not storing all the chunks in the memory. As soon as a new chunk is cut, the full chunk is flushed to the disk and memory-mapped from the disk while only storing a reference in the memory.

If a chunk is flushed to disk as soon as a new one is cut (instead of only when the block is done), larger chunks would indeed result in increased memory usage, right?

EDIT: Looking a bit more closely, I think the code matches that behavior (after the head chunk is mmapped, s.headChunk points to a new &memChunk).

@bboreham
Copy link
Member

Good point.

Small note: if the data is queried it will be brought back into memory. So it’s relevant that Prombench was not running any queries.
(However it’s likely that data collected will be much more than data queried, so this is not a huge effect)

@bwplotka
Copy link
Member

Side note: If we are considering changing that number - what about making this (e.g. hidden for now) flag?

@beorn7
Copy link
Member

beorn7 commented Mar 27, 2023

Yeah, sure, if we end up with tweaking the number of samples per chunk, it could be a hidden flag for the time being.

But maybe going down the path of trying to set a target in bytes rather than samples is anyway better. (Which then, of course, might be a (hidden) flag, too.)

Signed-off-by: Justin Lei <justin.lei@grafana.com>
@leizor leizor force-pushed the leizor/prometheus/issues/12009 branch from 33409d1 to fd08473 Compare April 6, 2023 16:23
Signed-off-by: Justin Lei <justin.lei@grafana.com>
@leizor leizor force-pushed the leizor/prometheus/issues/12009 branch from fd08473 to 73ff91d Compare April 6, 2023 16:43
@leizor leizor marked this pull request as ready for review April 6, 2023 17:15
@leizor
Copy link
Contributor Author

leizor commented Apr 6, 2023

Do we want to add the flag in this PR or defer to later one?

@beorn7
Copy link
Member

beorn7 commented Apr 11, 2023

I would say, let's put it into a (hidden) flag here in this PR (similar to --storage.tsdb.min-block-duration and friends, e.g. --storage.tsdb.max-samples-per-chunk).

With the flags, users can experiment, and based on the experiences, we can then set a new default limit that's generally better (or maybe we go down the path of limiting the number of bytes after all).

But if we hardcode a different limit now, and then it turns out it's detrimental in some cases, it will create a needless confusion.

@bwplotka
Copy link
Member

I would say we need a flag.

@leizor leizor force-pushed the leizor/prometheus/issues/12009 branch from fbacc6d to ab4af48 Compare April 13, 2023 21:45
leizor added 2 commits April 13, 2023 15:59
Signed-off-by: Justin Lei <justin.lei@grafana.com>
Signed-off-by: Justin Lei <justin.lei@grafana.com>
@leizor leizor force-pushed the leizor/prometheus/issues/12009 branch from ab4af48 to c3e6b85 Compare April 13, 2023 23:00
@leizor
Copy link
Contributor Author

leizor commented Apr 13, 2023

Ok, I've added the new flag storage.tsdb.samples-per-chunk and defaulted it to the original value of 120 (I think that's what we landed on).

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that works.

@roidelapluie @bwplotka @jesusvazquez any last words? Otherwise I'll merge tomorrow.

Copy link
Member

@jesusvazquez jesusvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice job @leizor

I am also uncertain about why memory slightly increased with 220 samples, would be good to look into that.

Also good call on adding the flag.

@beorn7
Copy link
Member

beorn7 commented May 10, 2023

Thanks, merging now…

@beorn7 beorn7 merged commit 37fe9b8 into prometheus:main May 10, 2023
@colega
Copy link
Contributor

colega commented May 24, 2023

I've got two notes about this:

One is regarding the memory usage: for which I guess one of the reasons are the 8 extra bytes of that setting we're storing per memSeries.

What was the reason to do this setting per memSeries? Is it to make sure that the 1/4th chunk estimation works correctly if the setting is changed in the Head? But this doesn't sound like a setting someone would want to change in runtime, and even if they did, the side-effect of having uneven chunks once doesn't seem to be an argument to store that integer in every single series.

The other one is regarding how that strange amount of 220 samples (which isn't a multiple of 15, 30, 60: the usual Prometheus scraping intervals) really works with the chunk size estimation:

prometheus/tsdb/head_append.go

Lines 1371 to 1377 in 37e5249

// If we reach 25% of a chunk's desired sample count, predict an end time
// for this chunk that will try to make samples equally distributed within
// the remaining chunks in the current chunk range.
// At latest it must happen at the timestamp set when the chunk was cut.
if numSamples == s.samplesPerChunk/4 {
s.nextAt = computeChunkEndTime(c.minTime, c.maxTime, s.nextAt)
}

I was wondering if the estimation algorithm would maybe try to issue 3 x 160-samples chunks for a 15s scrape interval (respecting the 220 maximum, making it useless)? But after writing some tests I see that setting this at 220, we'll actually issue just 240-samples chunks:

diff --git tsdb/head_test.go tsdb/head_test.go
index af3df378e..7b69f11c1 100644
--- tsdb/head_test.go
+++ tsdb/head_test.go
@@ -1318,6 +1318,20 @@ func TestComputeChunkEndTime(t *testing.T) {
                        max:   1000,
                        res:   104,
                },
+               {
+                       // Test 15s scrape inteval with 220 samples per chunk.
+                       start: 0,                // Start at 0 to make it easier.
+                       cur:   15e3 * (220 / 4), // We're at 1/4th after receiving 220/4 samples each 15s.
+                       max:   2 * 60 * 60e3,    // 2h
+                       res:   1 * 60 * 60e3,    // 1h, i.e. 2-240 samples chunks per hour
+               },
+               {
+                       // Test 30s scrape inteval with 220 samples per chunk.
+                       start: 0,                // Start at 0 to make it easier.
+                       cur:   30e3 * (220 / 4), // We're at 1/4th after receiving 220/4 samples each 30s.
+                       max:   2 * 60 * 60e3,    // 2h
+                       res:   2 * 60 * 60e3,    // 4, i.e. 1 240-samples chunk.
+               },
        }
 
        for _, c := range cases {

Maybe we should change the setting to 240 then? I know this is specific to the default 2h blocks, but I don't think anyone runs 110-minutes blocks, and it's strange to see chunks issued with a different size compared to what was set.

@beorn7
Copy link
Member

beorn7 commented May 24, 2023

Very good point about storing the setting per memSeries. That seems unneeded indeed. Even if it were not fully responsible for the memory increase, it is still confusing conceptually (because it suggests that different memSeries might have a different setting). @leizor would you like to create an amendment PR?

Also good thoughts about the exact value of samplesPerChunk and its consequences for the actual chunk size. Note that the final version of this PR does not set the value to 220 but leaves it at 120 with the option of setting it via the command line. So we are good on this front, but users of the new flag should be aware of the interdependencies of the different TSDB flags. (They are all hidden advanced flags, so I'm not too concerned right now that it is hard to set them to something meaningful.)

@leizor
Copy link
Contributor Author

leizor commented May 24, 2023

@beorn7 Yep, will do!

@leizor leizor deleted the leizor/prometheus/issues/12009 branch May 24, 2023 16:57
@alanprot
Copy link
Contributor

alanprot commented May 24, 2023

Should we only set the config if its != 0 here?

Just got a divided by 0 error when updating prometheus on thanos! :D

thanos-io/thanos#6392

headOpts.SamplesPerChunk = opts.SamplesPerChunk

nervemind.. just saw the #12387

@leizor
Copy link
Contributor Author

leizor commented May 24, 2023

@colega

What was the reason to do this setting per memSeries?

It was just a way to pass the information down to where it was needed; we need to know what the value of samplesPerChunk is in appendPreprocessor of memSeries but memSeries doesn't have a reference to Head or anything to get that information otherwise.

I think I'm going to give memSeries a reference to HeadOptions instead. Unfortunately it'll use the same extra 8 bytes per memSeries but I figure at least it wouldn't imply that the value could be different between memSeries.

EDIT: Actually, I think I'll add samplesPerChunk as a parameter to append and appendPrepocessor and pass it in that way.

@beorn7 beorn7 changed the title Adjust samplesPerChunk from 120 to 220 Make samplesPerChunk configurable May 31, 2023
@beorn7
Copy link
Member

beorn7 commented May 31, 2023

Updated the PR title to not cause confusion in the future.

mmMaxTime int64 // Max time of any mmapped chunk, only used during WAL replay.

nextAt int64 // Timestamp at which to cut the next chunk.
samplesPerChunk int // Target number of samples per chunk.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably make this uint16 and have validation during init to keep it within that range. Saves 48 bytes per series.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ended up pulling that out in #12390.

@codesome
Copy link
Member

codesome commented Jun 9, 2025

I am wondering if setting it to a higher number gets hit by #8582. I see that the logic still exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants