Add a chunk size limit in bytes by leizor · Pull Request #12054 · prometheus/prometheus

leizor · 2023-03-03T20:37:44Z

#11219
#12009

This PR adds a mechanism to cap the size of each chunk to a soft limit depending on the type of chunk; XOR chunks are capped to 1024 bytes ~~and (float) histogram chunks are capped to 1024 KiB~~.

EDIT: I'm sticking with the cap mechanism for XOR chunks, but moving to a target bytes per chunk for native histogram chunks (more details here).

leizor · 2023-03-03T20:38:54Z

tsdb/head_test.go


 		for j := 0; j < numSpans; j++ {
-			s := histogram.Span{Offset: 1 + int32(i), Length: spanLength}
+			s := histogram.Span{Offset: 1, Length: spanLength}


This change prevents a new chunk from being cut on every append, and I don't think it affects the benchmark that uses this method but let me know if I'm wrong about that!

Yeah, that was weird. Maybe it was a mistake from the beginning and @LeviHarrison meant to write

s := histogram.Span{Offset: 1 + int32(j), Length: spanLength}

? (j, not i)

It's been a while since I wrote that, but from what I can remember yes, it looks like j makes more sense and I must have made a mistake using i.

beorn7

I think the soft limit is fine (it definitely is for Prometheus – while I heard about thoughts that a hard chunk size limit would greatly help Mimir). I believe, too, that a hard limit is impossible for histograms anyway as a single histogram can have an arbitrary amount of bucket and therefore could exceed any given hard limit with a single sample.

As mentioned in an individual comment below: This PR implements a size limit in addition to the existing sample count limit (which is also a soft limit, interestingly). I'm not sure if that was intended. My current thought is that a size limit is superior to a sample count limit, and we should simply switch over to a size limit entirely. But it would be cool if we could base that decision on some benchmarking with real-world data.

For histograms, the research is even a bit more complex. Maybe for smaller histograms, the ideal trade-off for the chunk size is different than for larger histograms…

In any case, I think we also need the extrapolation heuristic with a size limit, see comments.

Maybe @codesome also has some thoughts. Might also help to share our mutual thoughts more directly in a chat or call…

beorn7 · 2023-03-12T23:17:49Z

tsdb/chunkenc/chunk.go

+	MaxBytesPerHistogramChunk      = 1024 * 1024
+	MaxBytesPerFloatHistogramChunk = 1024 * 1024


Did you just guess these sizes or did you derive them from some real-world observations?

My gut feeling is that 1MiB for a chunk is quite big. I would broadly estimate that a typical histogram sample is about 10x larger than a float sample. Which (even if I'm guessing right) doesn't necessarily imply that 10240 bytes is a good size. I think it really depends on how things work out best in practice. Primarily decoding time vs. space savings, but there might also be steps in the performance curve if we hit certain size limits (like cache lines etc.).

Maybe even 1024 bytes is a good value for histogram chunks. It's still about 100 samples if a typical histogram takes 10 bytes… (but does it? we should look at histograms as scraped from real-world systems.)

beorn7 · 2023-03-12T23:23:42Z

tsdb/head_test.go


-func generateBigTestHistograms(n int) []*histogram.Histogram {
-	const numBuckets = 500
+func generateBigTestHistograms(n, bktMultFactor int) []*histogram.Histogram {


bktMultFactor is a cryptic name. Let's perhaps use the number of buckets directly, and rename the function generateTestHistograms(numHistograms, numBuckets int).

beorn7 · 2023-03-12T23:28:05Z

tsdb/head_test.go


 		for j := 0; j < numSpans; j++ {
-			s := histogram.Span{Offset: 1 + int32(i), Length: spanLength}
+			s := histogram.Span{Offset: 1, Length: spanLength}


Yeah, that was weird. Maybe it was a mistake from the beginning and @LeviHarrison meant to write

s := histogram.Span{Offset: 1 + int32(j), Length: spanLength}

? (j, not i)

beorn7 · 2023-03-12T23:51:09Z

tsdb/head_append.go

+	} else {
+		maxBytesPerChunk := chunkenc.MaxBytesPerChunk(e)
+		if maxBytesPerChunk > 0 && len(c.chunk.Bytes()) > maxBytesPerChunk {
+			c = s.cutNewHeadChunk(t, e, chunkDiskMapper, chunkRange)
+			chunkCreated = true
+		}


Maybe this test should come after the out-of-order check below? (@codesome might know better)

A more fundamental thought: This adds the check for bytes to the check for sample count. Maybe that's intended for this first iteration. But it will probably only kick in rarely then because our current chunks with about 120 samples per chunk are (I guess) much smaller than 1024 bytes in most cases. Even histogram chunks with moderately sized histograms might not reach the 1024 bytes very often (and they might almost never reach the 1MiB proposed above).

But if the byte limit kicks in, then the whole heuristics below to create evenly sized chunks within a chunk range is circumvented. The code below tries to prevent that we get a chunk with 120 samples followed by a chunk with 10 samples by instead creating two chunks with 65 samples each. I think we need the same for the size limit, to prevent creating a chunk of 1024B followed by one of 10B and instead creating two chunks of 517B instead.

And orthogonal to the above: I would currently think we should switch from a sample limit to a size limit altogether and not have them both in parallel.

leizor · 2023-03-13T16:21:25Z

As mentioned in an individual comment below: This PR implements a size limit in addition to the existing sample count limit (which is also a soft limit, interestingly). I'm not sure if that was intended. My current thought is that a size limit is superior to a sample count limit, and we should simply switch over to a size limit entirely.

Ah, that makes sense. I initially thought of the size limit as a last-attempt type mechanism to limit chunk sizes but I think I get your thinking. I'll rework this PR with switching over from the sample count limit entirely in mind.

But it would be cool if we could base that decision on some benchmarking with real-world data.

I think I might have a decent idea of where to start for this sort of decision-making for float samples but I suspect I'll need some help for histograms--I'll probably reach out when I get to that point. 🙂

beorn7 · 2023-03-14T15:30:54Z

I think #12055 is an easy thing to do and should totally get merged if the more sophisticated approach here will take its time. (However, as commented in #12055, I'm actually surprised that the RAM usage seems to have increased slightly. CPU or query time improvements would be expected, but why RAM? It kind of hints towards the need for a better understanding of what is going on.)

leizor · 2023-03-26T18:48:10Z

I made the shift to target bytes per chunk instead of samples per chunk. I don't have a good value for target bytes for histogram chunks yet but I figured it'd be interesting to see how this change benchmarks for XOR chunks.

beorn7 · 2023-03-29T18:21:26Z

Do you want more review comments now or are you still working on this? (Asking because the PR is marked as draft.)

leizor · 2023-03-29T18:38:54Z

Hey @beorn7! Thanks for checking in.

I had a brief chat w/ @codesome yesterday about this and I think what we landed on was to keep the original samplesPerChunk + soft cap arrangement for XOR chunks and to use this new size limit mechanism only for native histogram chunks. I was going to start exploring that a bit, but I know it doesn't align with your suggestion of switching over to a size limit entirely.

I'm a bit at a loss regarding where this PR will ultimately end up; maybe it merits further discussion?

beorn7 · 2023-03-30T17:18:13Z

I don't have a strong opinion here. All my ideas expressed so far are based on educated guessing. For the real decision, we should look at benchmarks.

If there is no pressure, we can leave everything as-is for XOR chunks. (But my impression was there is some appetite to both put more samples into a chunk and also (soft-)limit the size of chunks.)

For histograms, we need to act in any case as 120 large histograms might create very large chunks. But even there, I'm not sure that a simple size limit will be ideal. For smaller histograms, it might be better to create somewhat smaller chunks (in byte) because we don't want to pay the overhead of larger chunks if the compression has already reached diminishing returns. On the other hand, for very large histograms, it might still be better to put a certain number of those into a chunk (let's say tens) to get any compression at all, even though the chunk gets quite large then.

So yeah, it needs some research.

In any case, I'll remove this PR from my review queue until there are news in which direction the work will go.

codesome · 2023-03-31T05:53:20Z

But my impression was there is some appetite to both put more samples into a chunk and also (soft-)limit the size of chunks.

Yes, but I would like to do it one at a time. Adding size limit first in this PR (for both chunks), and then looking at increasing sample count limit for XOR chunks next in #12055.

Given we try to have roughly equal size chunks in a block (the code), we might want to revisit that logic as well along with increase the sample count limit in #12055.

leizor · 2023-04-05T00:53:44Z

@codesome @beorn7

I think this PR is ready for some 👀 again.

I went the route of sticking with a samplesPerChunk + soft cap for XOR chunks (and planning on revisiting #12055 eventually as well) but moving to a target size per chunk for native histogram chunks. Native histogram chunks go through a similar end time estimation process as XOR chunks but I was worried that some histograms may be large enough to significantly alter the "1/4 full" assumption of computeChunkEndTime so I added a ratioToFull parameter to that function. That function also works with floats instead of ints now, but I'm hoping that's not too big a deal because it's something that only happens once per chunk.

I've chosen 1024 as the initial value for TargetBytesPerNativeHistogramChunk. Looking at ~26k chunks of request duration data, it at least doesn't immediately look like an unreasonable number to begin with:

34% of samples are under 2 bytes (512 samples per chunk if each sample is 2 bytes)
52% of samples are under 5 bytes (204 samples per chunk if each sample is 5 bytes)
93% of samples are under 25 bytes (40 samples per chunk if each sample is 25 bytes)
There's a bump of samples in the 45-60 byte range (17 samples per chunk if each sample is 60 bytes)

leizor · 2023-04-05T16:01:15Z

As a side note, (after updating a bunch of tests trying both ways) another reason I've chosen to stick with the samples per chunk for XOR chunks instead of also moving it to a target bytes per chunk is I think samples per chunk is overall easier to reason about than bytes per chunk so using it where it still makes sense to use seemed like a win.

tsdb/chunkenc/chunk.go

beorn7 · 2023-04-11T12:34:34Z

tsdb/head_append.go

+		s.nativeHistogramChunkHasComputedEndTime = true
+	}
+	// If numBytes > targetBytes*2 then our previous prediction was invalid,
+	// most likely because samples rate has changed and now they are arriving more frequently.


Or the bucket count has increased?

Ah yea, good point; there's a wider spread of valid reasons for invalid predictions--I'll make a note

beorn7 · 2023-04-11T13:01:58Z

OK, so let me reiterate what I have understood:

This PR introduces a size limit for histogram chunks with the same extrapolation logic as previously applied for the sample count limit for XOR chunks.
It also introduces an additional size limit for XOR chunk but without affecting the extrapolation logic, i.e. this is just a safety cap (and it is usually not hit because the vast majority of XOR chunks with 120 samples are smaller than 1024 bytes).

Three high-level thoughts:

Since we know the worst case sample size for XOR chunks (19 bytes), why not set the "soft" limit to 1005 bytes, resulting in a hard limit of 1024 bytes? Still, the majority of chunks won't be affected by this, but I guess it can be quite helpful for distributed storage implementations to know that a chunk is only over 1kiB at most.
If we start to experiment with higher numbers of samples per XOR chunk, the above rationale about the 1024B limit being rarely hit might not hold any longer. (We don't have to do anything about it in this PR, but we have to keep it in mind for Make samplesPerChunk configurable #12055 etc.)
The data about bytes per histogram is awesome. It's great to see real-world results. However, that's just one setup, and even if the vast majority of use cases will see similar distribution of sizes, I'm still concerned about the rare case where a histogram just has a lot of buckets. In which case we would create chunks with very few samples, killing off any compression advantage when we most need it. How about also setting a miminum number of samples per chunk? I guess something around 10. It will still give us ~90% of the compression potential while still preventing giant chunks.

beorn7 · 2023-04-11T13:20:16Z

tsdb/head_append.go

+	targetBytes := chunkenc.TargetBytesPerNativeHistogramChunk - s.nativeHistogramChunkBytesOverhead
+	numBytes := len(c.chunk.Bytes()) - s.nativeHistogramChunkBytesOverhead
+
+	if numBytes+s.nativeHistogramChunkBytesOverhead == 0 {


I think this logic should still use numSamples := c.chunk.NumSamples(); if numSamples == 0 {. @codesome you will know this much better.

Signed-off-by: Justin Lei <lei.justin@gmail.com>

beorn7 · 2023-08-15T20:01:24Z

I did a bit of archaeology (which was hard because of accidental squashing). The in-memory chunk snapshotting was introduced in #7229 . There is a lot of discussion in this PR but no comment on the part of the code we have run into. @bwplotka and @roidelapluie were reviewers back then. Do you maybe remember any context on the issue sketched out above?

I also checked the design doc and couldn't find any mention of this issue with cutting new chunks in the head.

beorn7

It might take a while until we get an answer from the experts. For now, let's implement the same behavior as for XOR chunks. I have an idea for that, sketched out in the notes, but I'm not sure if it works. Please double-check.

I think once that behavior is implemented, we can merge this for now and discuss the appending to chunks loaded from the snapshot later. If it is really a bug, a fix should help with chunk fragmentation.

beorn7 · 2023-08-16T16:02:32Z

tsdb/head_append.go

+		c.minTime = t
+		s.nextAt = rangeForTimestamp(c.minTime, o.chunkRange)
+	}
+


I think we can implement the corresponding (conservative) behavior as we know it from XOR chunks in the following:

If s.histogramChunkHasComputedEndTime is false here, we can simply set nextChunkRangeStart to the value of s.nextAt (because it was either set at chunk creation, or it was set when loading from the snapshot - either way, we want to end the chunk at that time, no matter what). And we must not set nextChunkRangeStart below, see note there.

Nice, yeah I think that works! I'll make the change.

beorn7 · 2023-08-16T16:04:56Z

tsdb/head_append.go

+	// increased or if the bucket/span count has increased.
+	// We also enforce a minimum sample count per chunk here.
+	// Note that next chunk will have its nextAt recalculated for the new rate.
+	nextChunkRangeStart := (o.chunkRange - (c.maxTime % o.chunkRange)) + c.maxTime


Is that the same as rangeForTimestamp(c.minTime, o.chunkRange)?

See also my note above.

Ah, yep I think it's the same; updating to use rangeForTimestamp instead.

Signed-off-by: Justin Lei <lei.justin@gmail.com>

leizor · 2023-08-18T00:57:14Z

@beorn7 The conservative approach sounds good! I've pushed a new commit for that.

I'll also open a fresh PR with regard to that chunk snapshot loading/chunk cutting issue and attempt to tag the relevant people on it.

beorn7

Thanks. Let's go with it. However, you need to solve the conflicts before I can merge.

…ssues/11219

leizor · 2023-08-24T17:56:32Z

Thanks for shepherding this PR through @beorn7!

Followup to prometheus/prometheus#12054 Instead of hardcoding the chunk boundary, ignore chunks that contain time 600. Instead of only checking the number of result chunks, verify that the right chunks are ignored and that there was ignore going on. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

* block TestRewrite should not depend on chunk sizes Followup to prometheus/prometheus#12054 Instead of hardcoding the chunk boundary, ignore chunks that contain time 600. Instead of only checking the number of result chunks, verify that the right chunks are ignored and that there was ignore going on. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

The index file sizes depends on the number of chunks in it and that recently changed due to prometheus/prometheus#12054 This test shouldn't care about the exact number so let's just grab from the OS. Noticed during #5927 Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

leizor commented Mar 3, 2023

View reviewed changes

leizor marked this pull request as ready for review March 3, 2023 21:05

leizor requested a review from codesome as a code owner March 3, 2023 21:05

beorn7 requested changes Mar 13, 2023

View reviewed changes

leizor mentioned this pull request Mar 13, 2023

Make samplesPerChunk configurable #12055

Merged

leizor force-pushed the leizor/prometheus/issues/11219 branch from 047005f to c93a5e0 Compare March 26, 2023 18:43

leizor requested review from bwplotka, csmarchbanks, cstyan and tomwilkie as code owners March 26, 2023 18:43

leizor marked this pull request as draft March 26, 2023 18:45

leizor force-pushed the leizor/prometheus/issues/11219 branch 6 times, most recently from 4443804 to bd70ee2 Compare April 5, 2023 00:03

leizor marked this pull request as ready for review April 5, 2023 00:44

beorn7 reviewed Apr 11, 2023

View reviewed changes

leizor added 4 commits August 8, 2023 19:16

Improve input values to test gauge histograms

f1ef8d7

Signed-off-by: Justin Lei <lei.justin@gmail.com>

Rename native histograms -> histograms

ed37eee

Signed-off-by: Justin Lei <lei.justin@gmail.com>

Set count on decoded histogram chunks

f008692

Signed-off-by: Justin Lei <lei.justin@gmail.com>

Rename tsdbutil.Sample -> chunks.Sample

1183228

Signed-off-by: Justin Lei <lei.justin@gmail.com>

leizor force-pushed the leizor/prometheus/issues/11219 branch from 7ab8a35 to 75af3a6 Compare August 9, 2023 02:25

leizor added 2 commits August 8, 2023 19:25

Simplify computeChunkEndTime

64601e1

Signed-off-by: Justin Lei <lei.justin@gmail.com>

Move bstream count reset logic to Appender methods

bdd9005

Signed-off-by: Justin Lei <lei.justin@gmail.com>

leizor force-pushed the leizor/prometheus/issues/11219 branch from 75af3a6 to bdd9005 Compare August 9, 2023 02:26

beorn7 reviewed Aug 16, 2023

View reviewed changes

Make sure to cut new chunk after loading histogram chunk snapshot

4550fa6

Signed-off-by: Justin Lei <lei.justin@gmail.com>

leizor force-pushed the leizor/prometheus/issues/11219 branch from b5028ea to 4550fa6 Compare August 18, 2023 00:50

beorn7 approved these changes Aug 23, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into leizor/prometheus/i…

b06229f

…ssues/11219

beorn7 approved these changes Aug 24, 2023

View reviewed changes

beorn7 merged commit 8ef7dfd into prometheus:main Aug 24, 2023

leizor deleted the leizor/prometheus/issues/11219 branch August 24, 2023 17:55

beorn7 mentioned this pull request Aug 30, 2023

promtool: Add support for native histograms #12146

Open

krajorama mentioned this pull request Sep 6, 2023

block TestRewrite should not depend on chunk sizes grafana/mimir#5945

Merged

1 task

krajorama mentioned this pull request Sep 6, 2023

block_test: do not hardcode index file size grafana/mimir#5950

Merged

1 task

krajorama mentioned this pull request Sep 20, 2023

Fix duplicate sample detection at chunk size limit #12874

Merged

aknuds1 mentioned this pull request Mar 13, 2024

tsdb/chunks: Remove unused function PopulatedChunk #13763

Merged

beorn7 mentioned this pull request Oct 18, 2025

test(tsdb): enable float histogram test for both true and false cases #17359

Closed

		MaxBytesPerHistogramChunk = 1024 * 1024
		MaxBytesPerFloatHistogramChunk = 1024 * 1024

Conversation

leizor commented Mar 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leizor commented Mar 13, 2023

Uh oh!

beorn7 commented Mar 14, 2023

Uh oh!

leizor commented Mar 26, 2023

Uh oh!

beorn7 commented Mar 29, 2023

Uh oh!

leizor commented Mar 29, 2023

Uh oh!

beorn7 commented Mar 30, 2023

Uh oh!

codesome commented Mar 31, 2023

Uh oh!

leizor commented Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leizor commented Apr 5, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beorn7 commented Apr 11, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beorn7 commented Aug 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leizor commented Aug 18, 2023

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

leizor commented Aug 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

leizor commented Mar 3, 2023 •

edited

Loading

leizor commented Apr 5, 2023 •

edited

Loading

beorn7 commented Aug 15, 2023 •

edited

Loading