Skip to content

Add a chunk size limit in bytes#12054

Merged
beorn7 merged 12 commits intoprometheus:mainfrom
leizor:leizor/prometheus/issues/11219
Aug 24, 2023
Merged

Add a chunk size limit in bytes#12054
beorn7 merged 12 commits intoprometheus:mainfrom
leizor:leizor/prometheus/issues/11219

Conversation

@leizor
Copy link
Copy Markdown
Contributor

@leizor leizor commented Mar 3, 2023

#11219
#12009

This PR adds a mechanism to cap the size of each chunk to a soft limit depending on the type of chunk; XOR chunks are capped to 1024 bytes and (float) histogram chunks are capped to 1024 KiB.

EDIT: I'm sticking with the cap mechanism for XOR chunks, but moving to a target bytes per chunk for native histogram chunks (more details here).


for j := 0; j < numSpans; j++ {
s := histogram.Span{Offset: 1 + int32(i), Length: spanLength}
s := histogram.Span{Offset: 1, Length: spanLength}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change prevents a new chunk from being cut on every append, and I don't think it affects the benchmark that uses this method but let me know if I'm wrong about that!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was weird. Maybe it was a mistake from the beginning and @LeviHarrison meant to write

s := histogram.Span{Offset: 1 + int32(j), Length: spanLength}

? (j, not i)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a while since I wrote that, but from what I can remember yes, it looks like j makes more sense and I must have made a mistake using i.

@leizor leizor marked this pull request as ready for review March 3, 2023 21:05
@leizor leizor requested a review from codesome as a code owner March 3, 2023 21:05
Copy link
Copy Markdown
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the soft limit is fine (it definitely is for Prometheus – while I heard about thoughts that a hard chunk size limit would greatly help Mimir). I believe, too, that a hard limit is impossible for histograms anyway as a single histogram can have an arbitrary amount of bucket and therefore could exceed any given hard limit with a single sample.

As mentioned in an individual comment below: This PR implements a size limit in addition to the existing sample count limit (which is also a soft limit, interestingly). I'm not sure if that was intended. My current thought is that a size limit is superior to a sample count limit, and we should simply switch over to a size limit entirely. But it would be cool if we could base that decision on some benchmarking with real-world data.

For histograms, the research is even a bit more complex. Maybe for smaller histograms, the ideal trade-off for the chunk size is different than for larger histograms…

In any case, I think we also need the extrapolation heuristic with a size limit, see comments.

Maybe @codesome also has some thoughts. Might also help to share our mutual thoughts more directly in a chat or call…

Comment on lines +68 to +69
MaxBytesPerHistogramChunk = 1024 * 1024
MaxBytesPerFloatHistogramChunk = 1024 * 1024
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you just guess these sizes or did you derive them from some real-world observations?

My gut feeling is that 1MiB for a chunk is quite big. I would broadly estimate that a typical histogram sample is about 10x larger than a float sample. Which (even if I'm guessing right) doesn't necessarily imply that 10240 bytes is a good size. I think it really depends on how things work out best in practice. Primarily decoding time vs. space savings, but there might also be steps in the performance curve if we hit certain size limits (like cache lines etc.).

Maybe even 1024 bytes is a good value for histogram chunks. It's still about 100 samples if a typical histogram takes 10 bytes… (but does it? we should look at histograms as scraped from real-world systems.)


func generateBigTestHistograms(n int) []*histogram.Histogram {
const numBuckets = 500
func generateBigTestHistograms(n, bktMultFactor int) []*histogram.Histogram {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bktMultFactor is a cryptic name. Let's perhaps use the number of buckets directly, and rename the function generateTestHistograms(numHistograms, numBuckets int).


for j := 0; j < numSpans; j++ {
s := histogram.Span{Offset: 1 + int32(i), Length: spanLength}
s := histogram.Span{Offset: 1, Length: spanLength}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was weird. Maybe it was a mistake from the beginning and @LeviHarrison meant to write

s := histogram.Span{Offset: 1 + int32(j), Length: spanLength}

? (j, not i)

Comment on lines +1349 to +1355
} else {
maxBytesPerChunk := chunkenc.MaxBytesPerChunk(e)
if maxBytesPerChunk > 0 && len(c.chunk.Bytes()) > maxBytesPerChunk {
c = s.cutNewHeadChunk(t, e, chunkDiskMapper, chunkRange)
chunkCreated = true
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this test should come after the out-of-order check below? (@codesome might know better)

A more fundamental thought: This adds the check for bytes to the check for sample count. Maybe that's intended for this first iteration. But it will probably only kick in rarely then because our current chunks with about 120 samples per chunk are (I guess) much smaller than 1024 bytes in most cases. Even histogram chunks with moderately sized histograms might not reach the 1024 bytes very often (and they might almost never reach the 1MiB proposed above).

But if the byte limit kicks in, then the whole heuristics below to create evenly sized chunks within a chunk range is circumvented. The code below tries to prevent that we get a chunk with 120 samples followed by a chunk with 10 samples by instead creating two chunks with 65 samples each. I think we need the same for the size limit, to prevent creating a chunk of 1024B followed by one of 10B and instead creating two chunks of 517B instead.

And orthogonal to the above: I would currently think we should switch from a sample limit to a size limit altogether and not have them both in parallel.

@leizor
Copy link
Copy Markdown
Contributor Author

leizor commented Mar 13, 2023

As mentioned in an individual comment below: This PR implements a size limit in addition to the existing sample count limit (which is also a soft limit, interestingly). I'm not sure if that was intended. My current thought is that a size limit is superior to a sample count limit, and we should simply switch over to a size limit entirely.

Ah, that makes sense. I initially thought of the size limit as a last-attempt type mechanism to limit chunk sizes but I think I get your thinking. I'll rework this PR with switching over from the sample count limit entirely in mind.

But it would be cool if we could base that decision on some benchmarking with real-world data.

I think I might have a decent idea of where to start for this sort of decision-making for float samples but I suspect I'll need some help for histograms--I'll probably reach out when I get to that point. 🙂

@beorn7
Copy link
Copy Markdown
Member

beorn7 commented Mar 14, 2023

I think #12055 is an easy thing to do and should totally get merged if the more sophisticated approach here will take its time. (However, as commented in #12055, I'm actually surprised that the RAM usage seems to have increased slightly. CPU or query time improvements would be expected, but why RAM? It kind of hints towards the need for a better understanding of what is going on.)

@leizor leizor force-pushed the leizor/prometheus/issues/11219 branch from 047005f to c93a5e0 Compare March 26, 2023 18:43
@leizor leizor marked this pull request as draft March 26, 2023 18:45
@leizor
Copy link
Copy Markdown
Contributor Author

leizor commented Mar 26, 2023

I made the shift to target bytes per chunk instead of samples per chunk. I don't have a good value for target bytes for histogram chunks yet but I figured it'd be interesting to see how this change benchmarks for XOR chunks.

@beorn7
Copy link
Copy Markdown
Member

beorn7 commented Mar 29, 2023

Do you want more review comments now or are you still working on this? (Asking because the PR is marked as draft.)

@leizor
Copy link
Copy Markdown
Contributor Author

leizor commented Mar 29, 2023

Hey @beorn7! Thanks for checking in.

I had a brief chat w/ @codesome yesterday about this and I think what we landed on was to keep the original samplesPerChunk + soft cap arrangement for XOR chunks and to use this new size limit mechanism only for native histogram chunks. I was going to start exploring that a bit, but I know it doesn't align with your suggestion of switching over to a size limit entirely.

I'm a bit at a loss regarding where this PR will ultimately end up; maybe it merits further discussion?

@beorn7
Copy link
Copy Markdown
Member

beorn7 commented Mar 30, 2023

I don't have a strong opinion here. All my ideas expressed so far are based on educated guessing. For the real decision, we should look at benchmarks.

If there is no pressure, we can leave everything as-is for XOR chunks. (But my impression was there is some appetite to both put more samples into a chunk and also (soft-)limit the size of chunks.)

For histograms, we need to act in any case as 120 large histograms might create very large chunks. But even there, I'm not sure that a simple size limit will be ideal. For smaller histograms, it might be better to create somewhat smaller chunks (in byte) because we don't want to pay the overhead of larger chunks if the compression has already reached diminishing returns. On the other hand, for very large histograms, it might still be better to put a certain number of those into a chunk (let's say tens) to get any compression at all, even though the chunk gets quite large then.

So yeah, it needs some research.

In any case, I'll remove this PR from my review queue until there are news in which direction the work will go.

@codesome
Copy link
Copy Markdown
Member

But my impression was there is some appetite to both put more samples into a chunk and also (soft-)limit the size of chunks.

Yes, but I would like to do it one at a time. Adding size limit first in this PR (for both chunks), and then looking at increasing sample count limit for XOR chunks next in #12055.

Given we try to have roughly equal size chunks in a block (the code), we might want to revisit that logic as well along with increase the sample count limit in #12055.

@leizor leizor force-pushed the leizor/prometheus/issues/11219 branch 6 times, most recently from 4443804 to bd70ee2 Compare April 5, 2023 00:03
@leizor leizor marked this pull request as ready for review April 5, 2023 00:44
@leizor
Copy link
Copy Markdown
Contributor Author

leizor commented Apr 5, 2023

@codesome @beorn7

I think this PR is ready for some 👀 again.

I went the route of sticking with a samplesPerChunk + soft cap for XOR chunks (and planning on revisiting #12055 eventually as well) but moving to a target size per chunk for native histogram chunks. Native histogram chunks go through a similar end time estimation process as XOR chunks but I was worried that some histograms may be large enough to significantly alter the "1/4 full" assumption of computeChunkEndTime so I added a ratioToFull parameter to that function. That function also works with floats instead of ints now, but I'm hoping that's not too big a deal because it's something that only happens once per chunk.

I've chosen 1024 as the initial value for TargetBytesPerNativeHistogramChunk. Looking at ~26k chunks of request duration data, it at least doesn't immediately look like an unreasonable number to begin with:

Screenshot 2023-04-04 at 5 06 59 PM

  • 34% of samples are under 2 bytes (512 samples per chunk if each sample is 2 bytes)
  • 52% of samples are under 5 bytes (204 samples per chunk if each sample is 5 bytes)
  • 93% of samples are under 25 bytes (40 samples per chunk if each sample is 25 bytes)
  • There's a bump of samples in the 45-60 byte range (17 samples per chunk if each sample is 60 bytes)

@leizor
Copy link
Copy Markdown
Contributor Author

leizor commented Apr 5, 2023

As a side note, (after updating a bunch of tests trying both ways) another reason I've chosen to stick with the samples per chunk for XOR chunks instead of also moving it to a target bytes per chunk is I think samples per chunk is overall easier to reason about than bytes per chunk so using it where it still makes sense to use seemed like a win.

s.nativeHistogramChunkHasComputedEndTime = true
}
// If numBytes > targetBytes*2 then our previous prediction was invalid,
// most likely because samples rate has changed and now they are arriving more frequently.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or the bucket count has increased?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yea, good point; there's a wider spread of valid reasons for invalid predictions--I'll make a note

@beorn7
Copy link
Copy Markdown
Member

beorn7 commented Apr 11, 2023

OK, so let me reiterate what I have understood:

  • This PR introduces a size limit for histogram chunks with the same extrapolation logic as previously applied for the sample count limit for XOR chunks.
  • It also introduces an additional size limit for XOR chunk but without affecting the extrapolation logic, i.e. this is just a safety cap (and it is usually not hit because the vast majority of XOR chunks with 120 samples are smaller than 1024 bytes).

Three high-level thoughts:

  1. Since we know the worst case sample size for XOR chunks (19 bytes), why not set the "soft" limit to 1005 bytes, resulting in a hard limit of 1024 bytes? Still, the majority of chunks won't be affected by this, but I guess it can be quite helpful for distributed storage implementations to know that a chunk is only over 1kiB at most.
  2. If we start to experiment with higher numbers of samples per XOR chunk, the above rationale about the 1024B limit being rarely hit might not hold any longer. (We don't have to do anything about it in this PR, but we have to keep it in mind for Make samplesPerChunk configurable #12055 etc.)
  3. The data about bytes per histogram is awesome. It's great to see real-world results. However, that's just one setup, and even if the vast majority of use cases will see similar distribution of sizes, I'm still concerned about the rare case where a histogram just has a lot of buckets. In which case we would create chunks with very few samples, killing off any compression advantage when we most need it. How about also setting a miminum number of samples per chunk? I guess something around 10. It will still give us ~90% of the compression potential while still preventing giant chunks.

targetBytes := chunkenc.TargetBytesPerNativeHistogramChunk - s.nativeHistogramChunkBytesOverhead
numBytes := len(c.chunk.Bytes()) - s.nativeHistogramChunkBytesOverhead

if numBytes+s.nativeHistogramChunkBytesOverhead == 0 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic should still use numSamples := c.chunk.NumSamples(); if numSamples == 0 {. @codesome you will know this much better.

leizor added 4 commits August 8, 2023 19:16
Signed-off-by: Justin Lei <lei.justin@gmail.com>
Signed-off-by: Justin Lei <lei.justin@gmail.com>
Signed-off-by: Justin Lei <lei.justin@gmail.com>
Signed-off-by: Justin Lei <lei.justin@gmail.com>
@leizor leizor force-pushed the leizor/prometheus/issues/11219 branch from 7ab8a35 to 75af3a6 Compare August 9, 2023 02:25
leizor added 2 commits August 8, 2023 19:25
Signed-off-by: Justin Lei <lei.justin@gmail.com>
Signed-off-by: Justin Lei <lei.justin@gmail.com>
@leizor leizor force-pushed the leizor/prometheus/issues/11219 branch from 75af3a6 to bdd9005 Compare August 9, 2023 02:26
@beorn7
Copy link
Copy Markdown
Member

beorn7 commented Aug 15, 2023

I did a bit of archaeology (which was hard because of accidental squashing). The in-memory chunk snapshotting was introduced in #7229 . There is a lot of discussion in this PR but no comment on the part of the code we have run into. @bwplotka and @roidelapluie were reviewers back then. Do you maybe remember any context on the issue sketched out above?

I also checked the design doc and couldn't find any mention of this issue with cutting new chunks in the head.

Copy link
Copy Markdown
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might take a while until we get an answer from the experts. For now, let's implement the same behavior as for XOR chunks. I have an idea for that, sketched out in the notes, but I'm not sure if it works. Please double-check.

I think once that behavior is implemented, we can merge this for now and discuss the appending to chunks loaded from the snapshot later. If it is really a bug, a fix should help with chunk fragmentation.

c.minTime = t
s.nextAt = rangeForTimestamp(c.minTime, o.chunkRange)
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can implement the corresponding (conservative) behavior as we know it from XOR chunks in the following:

If s.histogramChunkHasComputedEndTime is false here, we can simply set nextChunkRangeStart to the value of s.nextAt (because it was either set at chunk creation, or it was set when loading from the snapshot - either way, we want to end the chunk at that time, no matter what). And we must not set nextChunkRangeStart below, see note there.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, yeah I think that works! I'll make the change.

// increased or if the bucket/span count has increased.
// We also enforce a minimum sample count per chunk here.
// Note that next chunk will have its nextAt recalculated for the new rate.
nextChunkRangeStart := (o.chunkRange - (c.maxTime % o.chunkRange)) + c.maxTime
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the same as rangeForTimestamp(c.minTime, o.chunkRange)?

See also my note above.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yep I think it's the same; updating to use rangeForTimestamp instead.

Signed-off-by: Justin Lei <lei.justin@gmail.com>
@leizor leizor force-pushed the leizor/prometheus/issues/11219 branch from b5028ea to 4550fa6 Compare August 18, 2023 00:50
@leizor
Copy link
Copy Markdown
Contributor Author

leizor commented Aug 18, 2023

@beorn7 The conservative approach sounds good! I've pushed a new commit for that.

I'll also open a fresh PR with regard to that chunk snapshot loading/chunk cutting issue and attempt to tag the relevant people on it.

Copy link
Copy Markdown
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Let's go with it. However, you need to solve the conflicts before I can merge.

@beorn7 beorn7 merged commit 8ef7dfd into prometheus:main Aug 24, 2023
@leizor leizor deleted the leizor/prometheus/issues/11219 branch August 24, 2023 17:55
@leizor
Copy link
Copy Markdown
Contributor Author

leizor commented Aug 24, 2023

Thanks for shepherding this PR through @beorn7!

krajorama added a commit to grafana/mimir that referenced this pull request Sep 6, 2023
Followup to prometheus/prometheus#12054

Instead of hardcoding the chunk boundary, ignore chunks that
contain time 600.
Instead of only checking the number of result chunks, verify that the
right chunks are ignored and that there was ignore going on.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
krajorama added a commit to grafana/mimir that referenced this pull request Sep 6, 2023
Followup to prometheus/prometheus#12054

Instead of hardcoding the chunk boundary, ignore chunks that
contain time 600.
Instead of only checking the number of result chunks, verify that the
right chunks are ignored and that there was ignore going on.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
krajorama added a commit to grafana/mimir that referenced this pull request Sep 6, 2023
* block TestRewrite should not depend on chunk sizes

Followup to prometheus/prometheus#12054

Instead of hardcoding the chunk boundary, ignore chunks that
contain time 600.
Instead of only checking the number of result chunks, verify that the
right chunks are ignored and that there was ignore going on.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
krajorama added a commit to grafana/mimir that referenced this pull request Sep 6, 2023
The index file sizes depends on the number of chunks in it
and that recently changed due to
prometheus/prometheus#12054
This test shouldn't care about the exact number so let's just
grab from the OS.

Noticed during #5927

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
krajorama added a commit to grafana/mimir that referenced this pull request Sep 6, 2023
The index file sizes depends on the number of chunks in it
and that recently changed due to
prometheus/prometheus#12054
This test shouldn't care about the exact number so let's just
grab from the OS.

Noticed during #5927

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants