Add a chunk size limit in bytes#12054
Conversation
|
|
||
| for j := 0; j < numSpans; j++ { | ||
| s := histogram.Span{Offset: 1 + int32(i), Length: spanLength} | ||
| s := histogram.Span{Offset: 1, Length: spanLength} |
There was a problem hiding this comment.
This change prevents a new chunk from being cut on every append, and I don't think it affects the benchmark that uses this method but let me know if I'm wrong about that!
There was a problem hiding this comment.
Yeah, that was weird. Maybe it was a mistake from the beginning and @LeviHarrison meant to write
s := histogram.Span{Offset: 1 + int32(j), Length: spanLength}? (j, not i)
There was a problem hiding this comment.
It's been a while since I wrote that, but from what I can remember yes, it looks like j makes more sense and I must have made a mistake using i.
beorn7
left a comment
There was a problem hiding this comment.
I think the soft limit is fine (it definitely is for Prometheus – while I heard about thoughts that a hard chunk size limit would greatly help Mimir). I believe, too, that a hard limit is impossible for histograms anyway as a single histogram can have an arbitrary amount of bucket and therefore could exceed any given hard limit with a single sample.
As mentioned in an individual comment below: This PR implements a size limit in addition to the existing sample count limit (which is also a soft limit, interestingly). I'm not sure if that was intended. My current thought is that a size limit is superior to a sample count limit, and we should simply switch over to a size limit entirely. But it would be cool if we could base that decision on some benchmarking with real-world data.
For histograms, the research is even a bit more complex. Maybe for smaller histograms, the ideal trade-off for the chunk size is different than for larger histograms…
In any case, I think we also need the extrapolation heuristic with a size limit, see comments.
Maybe @codesome also has some thoughts. Might also help to share our mutual thoughts more directly in a chat or call…
tsdb/chunkenc/chunk.go
Outdated
| MaxBytesPerHistogramChunk = 1024 * 1024 | ||
| MaxBytesPerFloatHistogramChunk = 1024 * 1024 |
There was a problem hiding this comment.
Did you just guess these sizes or did you derive them from some real-world observations?
My gut feeling is that 1MiB for a chunk is quite big. I would broadly estimate that a typical histogram sample is about 10x larger than a float sample. Which (even if I'm guessing right) doesn't necessarily imply that 10240 bytes is a good size. I think it really depends on how things work out best in practice. Primarily decoding time vs. space savings, but there might also be steps in the performance curve if we hit certain size limits (like cache lines etc.).
Maybe even 1024 bytes is a good value for histogram chunks. It's still about 100 samples if a typical histogram takes 10 bytes… (but does it? we should look at histograms as scraped from real-world systems.)
tsdb/head_test.go
Outdated
|
|
||
| func generateBigTestHistograms(n int) []*histogram.Histogram { | ||
| const numBuckets = 500 | ||
| func generateBigTestHistograms(n, bktMultFactor int) []*histogram.Histogram { |
There was a problem hiding this comment.
bktMultFactor is a cryptic name. Let's perhaps use the number of buckets directly, and rename the function generateTestHistograms(numHistograms, numBuckets int).
|
|
||
| for j := 0; j < numSpans; j++ { | ||
| s := histogram.Span{Offset: 1 + int32(i), Length: spanLength} | ||
| s := histogram.Span{Offset: 1, Length: spanLength} |
There was a problem hiding this comment.
Yeah, that was weird. Maybe it was a mistake from the beginning and @LeviHarrison meant to write
s := histogram.Span{Offset: 1 + int32(j), Length: spanLength}? (j, not i)
tsdb/head_append.go
Outdated
| } else { | ||
| maxBytesPerChunk := chunkenc.MaxBytesPerChunk(e) | ||
| if maxBytesPerChunk > 0 && len(c.chunk.Bytes()) > maxBytesPerChunk { | ||
| c = s.cutNewHeadChunk(t, e, chunkDiskMapper, chunkRange) | ||
| chunkCreated = true | ||
| } |
There was a problem hiding this comment.
Maybe this test should come after the out-of-order check below? (@codesome might know better)
A more fundamental thought: This adds the check for bytes to the check for sample count. Maybe that's intended for this first iteration. But it will probably only kick in rarely then because our current chunks with about 120 samples per chunk are (I guess) much smaller than 1024 bytes in most cases. Even histogram chunks with moderately sized histograms might not reach the 1024 bytes very often (and they might almost never reach the 1MiB proposed above).
But if the byte limit kicks in, then the whole heuristics below to create evenly sized chunks within a chunk range is circumvented. The code below tries to prevent that we get a chunk with 120 samples followed by a chunk with 10 samples by instead creating two chunks with 65 samples each. I think we need the same for the size limit, to prevent creating a chunk of 1024B followed by one of 10B and instead creating two chunks of 517B instead.
And orthogonal to the above: I would currently think we should switch from a sample limit to a size limit altogether and not have them both in parallel.
Ah, that makes sense. I initially thought of the size limit as a last-attempt type mechanism to limit chunk sizes but I think I get your thinking. I'll rework this PR with switching over from the sample count limit entirely in mind.
I think I might have a decent idea of where to start for this sort of decision-making for float samples but I suspect I'll need some help for histograms--I'll probably reach out when I get to that point. 🙂 |
|
I think #12055 is an easy thing to do and should totally get merged if the more sophisticated approach here will take its time. (However, as commented in #12055, I'm actually surprised that the RAM usage seems to have increased slightly. CPU or query time improvements would be expected, but why RAM? It kind of hints towards the need for a better understanding of what is going on.) |
047005f to
c93a5e0
Compare
|
I made the shift to target bytes per chunk instead of samples per chunk. I don't have a good value for target bytes for histogram chunks yet but I figured it'd be interesting to see how this change benchmarks for XOR chunks. |
|
Do you want more review comments now or are you still working on this? (Asking because the PR is marked as draft.) |
|
Hey @beorn7! Thanks for checking in. I had a brief chat w/ @codesome yesterday about this and I think what we landed on was to keep the original samplesPerChunk + soft cap arrangement for XOR chunks and to use this new size limit mechanism only for native histogram chunks. I was going to start exploring that a bit, but I know it doesn't align with your suggestion of switching over to a size limit entirely. I'm a bit at a loss regarding where this PR will ultimately end up; maybe it merits further discussion? |
|
I don't have a strong opinion here. All my ideas expressed so far are based on educated guessing. For the real decision, we should look at benchmarks. If there is no pressure, we can leave everything as-is for XOR chunks. (But my impression was there is some appetite to both put more samples into a chunk and also (soft-)limit the size of chunks.) For histograms, we need to act in any case as 120 large histograms might create very large chunks. But even there, I'm not sure that a simple size limit will be ideal. For smaller histograms, it might be better to create somewhat smaller chunks (in byte) because we don't want to pay the overhead of larger chunks if the compression has already reached diminishing returns. On the other hand, for very large histograms, it might still be better to put a certain number of those into a chunk (let's say tens) to get any compression at all, even though the chunk gets quite large then. So yeah, it needs some research. In any case, I'll remove this PR from my review queue until there are news in which direction the work will go. |
Yes, but I would like to do it one at a time. Adding size limit first in this PR (for both chunks), and then looking at increasing sample count limit for XOR chunks next in #12055. Given we try to have roughly equal size chunks in a block (the code), we might want to revisit that logic as well along with increase the sample count limit in #12055. |
4443804 to
bd70ee2
Compare
|
I think this PR is ready for some 👀 again. I went the route of sticking with a samplesPerChunk + soft cap for XOR chunks (and planning on revisiting #12055 eventually as well) but moving to a target size per chunk for native histogram chunks. Native histogram chunks go through a similar end time estimation process as XOR chunks but I was worried that some histograms may be large enough to significantly alter the "1/4 full" assumption of I've chosen 1024 as the initial value for
|
|
As a side note, (after updating a bunch of tests trying both ways) another reason I've chosen to stick with the samples per chunk for XOR chunks instead of also moving it to a target bytes per chunk is I think samples per chunk is overall easier to reason about than bytes per chunk so using it where it still makes sense to use seemed like a win. |
tsdb/head_append.go
Outdated
| s.nativeHistogramChunkHasComputedEndTime = true | ||
| } | ||
| // If numBytes > targetBytes*2 then our previous prediction was invalid, | ||
| // most likely because samples rate has changed and now they are arriving more frequently. |
There was a problem hiding this comment.
Or the bucket count has increased?
There was a problem hiding this comment.
Ah yea, good point; there's a wider spread of valid reasons for invalid predictions--I'll make a note
|
OK, so let me reiterate what I have understood:
Three high-level thoughts:
|
tsdb/head_append.go
Outdated
| targetBytes := chunkenc.TargetBytesPerNativeHistogramChunk - s.nativeHistogramChunkBytesOverhead | ||
| numBytes := len(c.chunk.Bytes()) - s.nativeHistogramChunkBytesOverhead | ||
|
|
||
| if numBytes+s.nativeHistogramChunkBytesOverhead == 0 { |
There was a problem hiding this comment.
I think this logic should still use numSamples := c.chunk.NumSamples(); if numSamples == 0 {. @codesome you will know this much better.
Signed-off-by: Justin Lei <lei.justin@gmail.com>
Signed-off-by: Justin Lei <lei.justin@gmail.com>
Signed-off-by: Justin Lei <lei.justin@gmail.com>
Signed-off-by: Justin Lei <lei.justin@gmail.com>
7ab8a35 to
75af3a6
Compare
Signed-off-by: Justin Lei <lei.justin@gmail.com>
Signed-off-by: Justin Lei <lei.justin@gmail.com>
75af3a6 to
bdd9005
Compare
|
I did a bit of archaeology (which was hard because of accidental squashing). The in-memory chunk snapshotting was introduced in #7229 . There is a lot of discussion in this PR but no comment on the part of the code we have run into. @bwplotka and @roidelapluie were reviewers back then. Do you maybe remember any context on the issue sketched out above? I also checked the design doc and couldn't find any mention of this issue with cutting new chunks in the head. |
beorn7
left a comment
There was a problem hiding this comment.
It might take a while until we get an answer from the experts. For now, let's implement the same behavior as for XOR chunks. I have an idea for that, sketched out in the notes, but I'm not sure if it works. Please double-check.
I think once that behavior is implemented, we can merge this for now and discuss the appending to chunks loaded from the snapshot later. If it is really a bug, a fix should help with chunk fragmentation.
| c.minTime = t | ||
| s.nextAt = rangeForTimestamp(c.minTime, o.chunkRange) | ||
| } | ||
|
|
There was a problem hiding this comment.
I think we can implement the corresponding (conservative) behavior as we know it from XOR chunks in the following:
If s.histogramChunkHasComputedEndTime is false here, we can simply set nextChunkRangeStart to the value of s.nextAt (because it was either set at chunk creation, or it was set when loading from the snapshot - either way, we want to end the chunk at that time, no matter what). And we must not set nextChunkRangeStart below, see note there.
There was a problem hiding this comment.
Nice, yeah I think that works! I'll make the change.
tsdb/head_append.go
Outdated
| // increased or if the bucket/span count has increased. | ||
| // We also enforce a minimum sample count per chunk here. | ||
| // Note that next chunk will have its nextAt recalculated for the new rate. | ||
| nextChunkRangeStart := (o.chunkRange - (c.maxTime % o.chunkRange)) + c.maxTime |
There was a problem hiding this comment.
Is that the same as rangeForTimestamp(c.minTime, o.chunkRange)?
See also my note above.
There was a problem hiding this comment.
Ah, yep I think it's the same; updating to use rangeForTimestamp instead.
Signed-off-by: Justin Lei <lei.justin@gmail.com>
b5028ea to
4550fa6
Compare
|
@beorn7 The conservative approach sounds good! I've pushed a new commit for that. I'll also open a fresh PR with regard to that chunk snapshot loading/chunk cutting issue and attempt to tag the relevant people on it. |
beorn7
left a comment
There was a problem hiding this comment.
Thanks. Let's go with it. However, you need to solve the conflicts before I can merge.
|
Thanks for shepherding this PR through @beorn7! |
Followup to prometheus/prometheus#12054 Instead of hardcoding the chunk boundary, ignore chunks that contain time 600. Instead of only checking the number of result chunks, verify that the right chunks are ignored and that there was ignore going on. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Followup to prometheus/prometheus#12054 Instead of hardcoding the chunk boundary, ignore chunks that contain time 600. Instead of only checking the number of result chunks, verify that the right chunks are ignored and that there was ignore going on. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
* block TestRewrite should not depend on chunk sizes Followup to prometheus/prometheus#12054 Instead of hardcoding the chunk boundary, ignore chunks that contain time 600. Instead of only checking the number of result chunks, verify that the right chunks are ignored and that there was ignore going on. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
The index file sizes depends on the number of chunks in it and that recently changed due to prometheus/prometheus#12054 This test shouldn't care about the exact number so let's just grab from the OS. Noticed during #5927 Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
The index file sizes depends on the number of chunks in it and that recently changed due to prometheus/prometheus#12054 This test shouldn't care about the exact number so let's just grab from the OS. Noticed during #5927 Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

#11219
#12009
This PR adds a mechanism to cap the size of each chunk to a soft limit depending on the type of chunk; XOR chunks are capped to 1024 bytes
and (float) histogram chunks are capped to 1024 KiB.EDIT: I'm sticking with the cap mechanism for XOR chunks, but moving to a target bytes per chunk for native histogram chunks (more details here).