Remove Individual Data Type Caps in Per-shard Buffering for Remote Write by LeviHarrison · Pull Request #8921 · prometheus/prometheus

LeviHarrison · 2021-06-11T03:00:13Z

This PR consolidates the sampleBuffer and exemplarBuffer buffers into the pendingData buffer, removing the individual cap for each respective data type in favor of having a cap for all data points combined.

Fixes #8788

roidelapluie · 2021-06-29T15:29:03Z

Thanks for this. @cstyan would you have some time to review it?

@LeviHarrison did you run any benchmarks on this?

LeviHarrison · 2021-06-29T15:29:45Z

Not yet, I probably should.

cstyan · 2021-06-30T05:36:26Z

I should be able to review next week, end of this week is a holiday here. Benchmarks with memory stats would be good :)

LeviHarrison · 2021-07-10T18:52:46Z

I set up a benchmark using Grafana's TNS demo, so plenty of exemplars. I took this screenshot after it had been running for about 10 minutes.

LeviHarrison · 2021-07-10T19:04:28Z

After about 20 minutes:

LeviHarrison · 2021-07-11T16:02:24Z

Speaking of benchmarking... prometheus/test-infra#249

cstyan

Thanks for taking this on @LeviHarrison, looking good so far. The benchmark screenshots look good if we can see those results consistently; the same allocations and overall memory usage but better pending samples throughput.

It would be nice if we could benchmark this in a more automated fashion, especially for any more potential future work. Maybe BenchmarkSampleDelivery could be extended to also send exemplars?

Dave Cheney's post here has some good content on benchmarks: https://dave.cheney.net/high-performance-go-workshop/gophercon-2019.html#profiling_benchmarks

storage/remote/queue_manager.go

LeviHarrison · 2021-07-17T02:58:58Z

// TODO(cstyan): Casting this many times smells, also we could get index out of bounds issues here.

I removed this comment from where maxExemplars was assigned, is it still valid?

LeviHarrison · 2021-07-17T03:01:23Z

@cstyan I'll take a look at expanding the benchmarks. I think that will probably end up as a separate PR thought.

csmarchbanks

Generally I think this is close, but left one performance concern, and I think a solution for it.

csmarchbanks · 2021-07-20T18:00:25Z

storage/remote/queue_manager.go

-				sampleBuffer[nPendingSamples][0] = d.sample
 				pendingData[nPending].Labels = labelsToLabelsProto(d.seriesLabels, pendingData[nPending].Labels)
-				pendingData[nPending].Samples = sampleBuffer[nPendingSamples]
+				pendingData[nPending].Samples = []prompb.Sample{d.sample}


By creating a new slice of samples I fear that we are reducing some of the allocation optimizations made in #5849. I wonder if we could still pre-allocate samples and exemplars with len=0 and cap=1, then either set the value or truncate the slice as we use them?

Here is a benchmark you can use to see these concerns:

name old time/op new time/op delta SampleDelivery-16 15.5s ± 2% 16.9s ± 1% +8.72% (p=0.001 n=7+6) name old alloc/op new alloc/op delta SampleDelivery-16 20.5GB ± 0% 21.7GB ± 0% +5.85% (p=0.001 n=7+7) name old allocs/op new allocs/op delta SampleDelivery-16 176M ± 0% 201M ± 0% +14.20% (p=0.001 n=7+7)

The screenshots you posted don't show much of a difference, so it is possible that this micro benchmark is no longer valid, but it would be good to see for sure as at one point it was useful for me.

I think what you're saying is that when we allocate pendingData we should also allocate len=0 series and exemplar slices to pre-populate the fields. The concern there was we'd over-allocate exemplars, but I think that if an allocation now happens for every single new sample, the initial penalty is well worth it.

Yep, if someone has send exemplars turned on chances are they have some exemplars so eventually all of those would be allocated anyway, and it is much better than creating garbage that needs to be cleaned up for every sample.

Actually, after trying to implement this I don't think it will work. Before when there were two separate slices that only held one type (series or exemplar), they would get sent flushed to the index they had been filled, eliminated the possibility of a sample from the last cycle being sent again. But because now each slice can hold either a series or exemplar, the sample from the last cycle won't get wiped out if they aren't of the same type as the new sample. The only way no samples get re-flushed is if the slice is filled in the exact same order every single time, which isn't guaranteed.

Hmm, yeah this may add a fair bit more complexity. My thought is something like:

pendingData[nPending].Samples = pendingData[nPending].Samples[:0] pendingData[nPending].Exemplars = pendingData[nPending].Exemplars[:0] switch d := sample.(type) { case writeSample: pendingData[nPending].Samples = append(pendingData[nPending].Samples, d.sample) case writeExemplar: pendingData[nPending].Exemplars = append(pendingData[nPending].Exemplars, d.exemplar) }

Do you think that would work? You might be able to use some ifs to avoid the truncations of the slices, but I don't know if it would help with performance at all.

I'm not sure, let's give it a try. Just pushed the change, tests pass. I'm not sure what benchmark you were using, if I was I would run it myself.

I have been running:

go test -bench=BenchmarkSampleDelivery -benchmem -run=djfkasldf ./storage/remote

for the benchmark. Here are the results for before this change, without preallocation, and with preallocation:

benchstat before.txt no_preallocation.txt preallocation.txt name \ time/op before.txt no_preallocation.txt preallocation.txt SampleDelivery-16 15.5s ± 2% 16.9s ± 1% 15.1s ± 1% name \ alloc/op before.txt no_preallocation.txt preallocation.txt SampleDelivery-16 20.5GB ± 0% 21.7GB ± 0% 20.5GB ± 0% name \ allocs/op before.txt no_preallocation.txt preallocation.txt SampleDelivery-16 176M ± 0% 201M ± 0% 176M ± 0%

So looks like we are back to the same amount of allocations now (yay!) and might even be a couple percent faster compared to before this PR.

csmarchbanks

👍

csmarchbanks · 2021-07-27T15:41:01Z

storage/remote/queue_manager.go

+	var pendingData = make([]prompb.TimeSeries, max)
+	for i := range pendingData {
+		pendingData[i].Samples = []prompb.Sample{{}}
+		pendingData[i].Exemplars = []prompb.Exemplar{{}}


I just realized: I guess we actually only need to allocate this if s.qm.sendExemplars? We would probably also have to protect line 1074 with an if as well at that point. Seems like it wouldn't make much of a difference in practice so 🤷.

LeviHarrison · 2021-07-27T16:41:56Z

We could also go the approach of:

var reset func()
if !s.qm.sendExemplars {
  reset = func() {
    pendingData[nPending].Samples = pendingData[nPending].Samples[:0]
  }
} else {
  reset = func() {
    pendingData[nPending].Samples = pendingData[nPending].Samples[:0]
    pendingData[nPending].Exemplars = pendingData[nPending].Exemplars[:0]
  }
}

might be cleaner, maybe even better for performance, IDK

Signed-off-by: Levi Harrison <git@leviharrison.dev>

csmarchbanks

Approving the small new changes as well. @cstyan any other concerns on this one, otherwise lets merge it?

cstyan · 2021-08-09T20:30:51Z

👍 lgtm

LeviHarrison · 2021-08-09T21:21:25Z

Thanks for all your time reviewing!

* Fix `kuma_sd` targetgroup reporting (#9157) * Bundle all xDS targets into a single group Signed-off-by: austin ce <austin.cawley@gmail.com> * Snapshot in-memory chunks on shutdown for faster restarts (#7229) Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> * Rename links Signed-off-by: Levi Harrison <git@leviharrison.dev> * Remove Individual Data Type Caps in Per-shard Buffering for Remote Write (#8921) * Moved everything to nPending buffer Signed-off-by: Levi Harrison <git@leviharrison.dev> * Simplify exemplar capacity addition Signed-off-by: Levi Harrison <git@leviharrison.dev> * Added pre-allocation Signed-off-by: Levi Harrison <git@leviharrison.dev> * Don't allocate if not sending exemplars Signed-off-by: Levi Harrison <git@leviharrison.dev> * Avoid deadlock when processing duplicate series record (#9170) * Avoid deadlock when processing duplicate series record `processWALSamples()` needs to be able to send on its output channel before it can read the input channel, so reads to allow this in case the output channel is full. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * processWALSamples: update comment Previous text seems to relate to an earlier implementation. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Optimise WAL loading by removing extra map and caching min-time (#9160) * BenchmarkLoadWAL: close WAL after use So that goroutines are stopped and resources released Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * BenchmarkLoadWAL: make series IDs co-prime with #workers Series are distributed across workers by taking the modulus of the ID with the number of workers, so multiples of 100 are a poor choice. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * BenchmarkLoadWAL: simulate mmapped chunks Real Prometheus cuts chunks every 120 samples, then skips those samples when re-reading the WAL. Simulate this by creating a single mapped chunk for each series, since the max time is all the reader looks at. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Fix comment Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Remove series map from processWALSamples() The locks that is commented to reduce contention in are now sharded 32,000 ways, so won't be contended. Removing the map saves memory and goes just as fast. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * loadWAL: Cache the last mmapped chunk time So we can skip calling append() for samples it will reject. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Improvements from code review Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Full stops and capitals on comments Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Cache max time in both places mmappedChunks is updated Including refactor to extract function `setMMappedChunks`, to reduce code duplication. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Update head min/max time when mmapped chunks added This ensures we have the correct values if no WAL samples are added for that series. Note that `mSeries.maxTime()` was always `math.MinInt64` before, since that function doesn't consider mmapped chunks. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Split Go and React Tests (#8897) * Added go-ci and react-ci Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu> Signed-off-by: Levi Harrison <git@leviharrison.dev> * Remove search keymap from new expression editor (#9184) Signed-off-by: Julius Volz <julius.volz@gmail.com> Co-authored-by: Austin Cawley-Edwards <austin.cawley@gmail.com> Co-authored-by: Levi Harrison <git@leviharrison.dev> Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu> Co-authored-by: Bryan Boreham <bjboreham@gmail.com> Co-authored-by: Julius Volz <julius.volz@gmail.com>

LeviHarrison requested review from bwplotka, csmarchbanks, cstyan and tomwilkie as code owners June 11, 2021 03:00

LeviHarrison force-pushed the exemplar-shard-buffer branch from a625dd1 to 273262f Compare June 11, 2021 03:02

cstyan reviewed Jul 16, 2021

View reviewed changes

storage/remote/queue_manager.go Outdated Show resolved Hide resolved

csmarchbanks reviewed Jul 20, 2021

View reviewed changes

csmarchbanks approved these changes Jul 27, 2021

View reviewed changes

csmarchbanks reviewed Jul 27, 2021

View reviewed changes

LeviHarrison closed this Jul 27, 2021

LeviHarrison reopened this Jul 27, 2021

LeviHarrison added 4 commits July 27, 2021 12:42

Moved everything to nPending buffer

bd36fcb

Signed-off-by: Levi Harrison <git@leviharrison.dev>

Simplify exemplar capacity addition

7de16b4

Signed-off-by: Levi Harrison <git@leviharrison.dev>

Added pre-allocation

874e433

Signed-off-by: Levi Harrison <git@leviharrison.dev>

Don't allocate if not sending exemplars

38edd67

Signed-off-by: Levi Harrison <git@leviharrison.dev>

LeviHarrison force-pushed the exemplar-shard-buffer branch from 909ba1c to 38edd67 Compare July 27, 2021 16:42

csmarchbanks approved these changes Aug 2, 2021

View reviewed changes

csmarchbanks merged commit fac1b57 into prometheus:main Aug 9, 2021

Conversation

LeviHarrison commented Jun 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roidelapluie commented Jun 29, 2021

Uh oh!

LeviHarrison commented Jun 29, 2021

Uh oh!

cstyan commented Jun 30, 2021

Uh oh!

LeviHarrison commented Jul 10, 2021

Uh oh!

LeviHarrison commented Jul 10, 2021

Uh oh!

LeviHarrison commented Jul 11, 2021

Uh oh!

cstyan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LeviHarrison commented Jul 17, 2021

Uh oh!

LeviHarrison commented Jul 17, 2021

Uh oh!

csmarchbanks left a comment

Choose a reason for hiding this comment

Uh oh!

csmarchbanks Jul 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LeviHarrison Jul 20, 2021

Choose a reason for hiding this comment

Uh oh!

csmarchbanks Jul 21, 2021

Choose a reason for hiding this comment

Uh oh!

LeviHarrison Jul 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csmarchbanks Jul 26, 2021

Choose a reason for hiding this comment

Uh oh!

LeviHarrison Jul 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csmarchbanks Jul 27, 2021

Choose a reason for hiding this comment

Uh oh!

csmarchbanks left a comment

Choose a reason for hiding this comment

Uh oh!

csmarchbanks Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LeviHarrison Jul 27, 2021

Choose a reason for hiding this comment

Uh oh!

LeviHarrison commented Jul 27, 2021

Uh oh!

csmarchbanks left a comment

Choose a reason for hiding this comment

Uh oh!

cstyan commented Aug 9, 2021

Uh oh!

LeviHarrison commented Aug 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LeviHarrison commented Jun 11, 2021 •

edited

Loading

csmarchbanks Jul 20, 2021 •

edited

Loading

LeviHarrison Jul 24, 2021 •

edited

Loading

LeviHarrison Jul 26, 2021 •

edited

Loading

csmarchbanks Jul 27, 2021 •

edited

Loading