Remove Individual Data Type Caps in Per-shard Buffering for Remote Write#8921
Conversation
a625dd1 to
273262f
Compare
|
Thanks for this. @cstyan would you have some time to review it? @LeviHarrison did you run any benchmarks on this? |
|
Not yet, I probably should. |
|
I should be able to review next week, end of this week is a holiday here. Benchmarks with memory stats would be good :) |
|
Speaking of benchmarking... prometheus/test-infra#249 |
cstyan
left a comment
There was a problem hiding this comment.
Thanks for taking this on @LeviHarrison, looking good so far. The benchmark screenshots look good if we can see those results consistently; the same allocations and overall memory usage but better pending samples throughput.
It would be nice if we could benchmark this in a more automated fashion, especially for any more potential future work. Maybe BenchmarkSampleDelivery could be extended to also send exemplars?
Dave Cheney's post here has some good content on benchmarks: https://dave.cheney.net/high-performance-go-workshop/gophercon-2019.html#profiling_benchmarks
|
I removed this comment from where |
|
@cstyan I'll take a look at expanding the benchmarks. I think that will probably end up as a separate PR thought. |
csmarchbanks
left a comment
There was a problem hiding this comment.
Generally I think this is close, but left one performance concern, and I think a solution for it.
storage/remote/queue_manager.go
Outdated
| sampleBuffer[nPendingSamples][0] = d.sample | ||
| pendingData[nPending].Labels = labelsToLabelsProto(d.seriesLabels, pendingData[nPending].Labels) | ||
| pendingData[nPending].Samples = sampleBuffer[nPendingSamples] | ||
| pendingData[nPending].Samples = []prompb.Sample{d.sample} |
There was a problem hiding this comment.
By creating a new slice of samples I fear that we are reducing some of the allocation optimizations made in #5849. I wonder if we could still pre-allocate samples and exemplars with len=0 and cap=1, then either set the value or truncate the slice as we use them?
Here is a benchmark you can use to see these concerns:
name old time/op new time/op delta
SampleDelivery-16 15.5s ± 2% 16.9s ± 1% +8.72% (p=0.001 n=7+6)
name old alloc/op new alloc/op delta
SampleDelivery-16 20.5GB ± 0% 21.7GB ± 0% +5.85% (p=0.001 n=7+7)
name old allocs/op new allocs/op delta
SampleDelivery-16 176M ± 0% 201M ± 0% +14.20% (p=0.001 n=7+7)
The screenshots you posted don't show much of a difference, so it is possible that this micro benchmark is no longer valid, but it would be good to see for sure as at one point it was useful for me.
There was a problem hiding this comment.
I think what you're saying is that when we allocate pendingData we should also allocate len=0 series and exemplar slices to pre-populate the fields. The concern there was we'd over-allocate exemplars, but I think that if an allocation now happens for every single new sample, the initial penalty is well worth it.
There was a problem hiding this comment.
Yep, if someone has send exemplars turned on chances are they have some exemplars so eventually all of those would be allocated anyway, and it is much better than creating garbage that needs to be cleaned up for every sample.
There was a problem hiding this comment.
Actually, after trying to implement this I don't think it will work. Before when there were two separate slices that only held one type (series or exemplar), they would get sent flushed to the index they had been filled, eliminated the possibility of a sample from the last cycle being sent again. But because now each slice can hold either a series or exemplar, the sample from the last cycle won't get wiped out if they aren't of the same type as the new sample. The only way no samples get re-flushed is if the slice is filled in the exact same order every single time, which isn't guaranteed.
There was a problem hiding this comment.
Hmm, yeah this may add a fair bit more complexity. My thought is something like:
pendingData[nPending].Samples = pendingData[nPending].Samples[:0]
pendingData[nPending].Exemplars = pendingData[nPending].Exemplars[:0]
switch d := sample.(type) {
case writeSample:
pendingData[nPending].Samples = append(pendingData[nPending].Samples, d.sample)
case writeExemplar:
pendingData[nPending].Exemplars = append(pendingData[nPending].Exemplars, d.exemplar)
}
Do you think that would work? You might be able to use some ifs to avoid the truncations of the slices, but I don't know if it would help with performance at all.
There was a problem hiding this comment.
I'm not sure, let's give it a try. Just pushed the change, tests pass. I'm not sure what benchmark you were using, if I was I would run it myself.
There was a problem hiding this comment.
I have been running:
go test -bench=BenchmarkSampleDelivery -benchmem -run=djfkasldf ./storage/remote
for the benchmark. Here are the results for before this change, without preallocation, and with preallocation:
benchstat before.txt no_preallocation.txt preallocation.txt
name \ time/op before.txt no_preallocation.txt preallocation.txt
SampleDelivery-16 15.5s ± 2% 16.9s ± 1% 15.1s ± 1%
name \ alloc/op before.txt no_preallocation.txt preallocation.txt
SampleDelivery-16 20.5GB ± 0% 21.7GB ± 0% 20.5GB ± 0%
name \ allocs/op before.txt no_preallocation.txt preallocation.txt
SampleDelivery-16 176M ± 0% 201M ± 0% 176M ± 0%
So looks like we are back to the same amount of allocations now (yay!) and might even be a couple percent faster compared to before this PR.
storage/remote/queue_manager.go
Outdated
| var pendingData = make([]prompb.TimeSeries, max) | ||
| for i := range pendingData { | ||
| pendingData[i].Samples = []prompb.Sample{{}} | ||
| pendingData[i].Exemplars = []prompb.Exemplar{{}} |
There was a problem hiding this comment.
I just realized: I guess we actually only need to allocate this if s.qm.sendExemplars? We would probably also have to protect line 1074 with an if as well at that point. Seems like it wouldn't make much of a difference in practice so 🤷.
|
We could also go the approach of: var reset func()
if !s.qm.sendExemplars {
reset = func() {
pendingData[nPending].Samples = pendingData[nPending].Samples[:0]
}
} else {
reset = func() {
pendingData[nPending].Samples = pendingData[nPending].Samples[:0]
pendingData[nPending].Exemplars = pendingData[nPending].Exemplars[:0]
}
}might be cleaner, maybe even better for performance, IDK |
Signed-off-by: Levi Harrison <git@leviharrison.dev>
Signed-off-by: Levi Harrison <git@leviharrison.dev>
Signed-off-by: Levi Harrison <git@leviharrison.dev>
Signed-off-by: Levi Harrison <git@leviharrison.dev>
909ba1c to
38edd67
Compare
csmarchbanks
left a comment
There was a problem hiding this comment.
Approving the small new changes as well. @cstyan any other concerns on this one, otherwise lets merge it?
|
👍 lgtm |
|
Thanks for all your time reviewing! |
* Fix `kuma_sd` targetgroup reporting (#9157) * Bundle all xDS targets into a single group Signed-off-by: austin ce <austin.cawley@gmail.com> * Snapshot in-memory chunks on shutdown for faster restarts (#7229) Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> * Rename links Signed-off-by: Levi Harrison <git@leviharrison.dev> * Remove Individual Data Type Caps in Per-shard Buffering for Remote Write (#8921) * Moved everything to nPending buffer Signed-off-by: Levi Harrison <git@leviharrison.dev> * Simplify exemplar capacity addition Signed-off-by: Levi Harrison <git@leviharrison.dev> * Added pre-allocation Signed-off-by: Levi Harrison <git@leviharrison.dev> * Don't allocate if not sending exemplars Signed-off-by: Levi Harrison <git@leviharrison.dev> * Avoid deadlock when processing duplicate series record (#9170) * Avoid deadlock when processing duplicate series record `processWALSamples()` needs to be able to send on its output channel before it can read the input channel, so reads to allow this in case the output channel is full. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * processWALSamples: update comment Previous text seems to relate to an earlier implementation. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Optimise WAL loading by removing extra map and caching min-time (#9160) * BenchmarkLoadWAL: close WAL after use So that goroutines are stopped and resources released Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * BenchmarkLoadWAL: make series IDs co-prime with #workers Series are distributed across workers by taking the modulus of the ID with the number of workers, so multiples of 100 are a poor choice. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * BenchmarkLoadWAL: simulate mmapped chunks Real Prometheus cuts chunks every 120 samples, then skips those samples when re-reading the WAL. Simulate this by creating a single mapped chunk for each series, since the max time is all the reader looks at. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Fix comment Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Remove series map from processWALSamples() The locks that is commented to reduce contention in are now sharded 32,000 ways, so won't be contended. Removing the map saves memory and goes just as fast. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * loadWAL: Cache the last mmapped chunk time So we can skip calling append() for samples it will reject. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Improvements from code review Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Full stops and capitals on comments Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Cache max time in both places mmappedChunks is updated Including refactor to extract function `setMMappedChunks`, to reduce code duplication. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Update head min/max time when mmapped chunks added This ensures we have the correct values if no WAL samples are added for that series. Note that `mSeries.maxTime()` was always `math.MinInt64` before, since that function doesn't consider mmapped chunks. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Split Go and React Tests (#8897) * Added go-ci and react-ci Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu> Signed-off-by: Levi Harrison <git@leviharrison.dev> * Remove search keymap from new expression editor (#9184) Signed-off-by: Julius Volz <julius.volz@gmail.com> Co-authored-by: Austin Cawley-Edwards <austin.cawley@gmail.com> Co-authored-by: Levi Harrison <git@leviharrison.dev> Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu> Co-authored-by: Bryan Boreham <bjboreham@gmail.com> Co-authored-by: Julius Volz <julius.volz@gmail.com>


This PR consolidates the
sampleBufferandexemplarBufferbuffers into thependingDatabuffer, removing the individual cap for each respective data type in favor of having a cap for all data points combined.Fixes #8788