Skip to content

Remove Individual Data Type Caps in Per-shard Buffering for Remote Write#8921

Merged
csmarchbanks merged 4 commits intoprometheus:mainfrom
LeviHarrison:exemplar-shard-buffer
Aug 9, 2021
Merged

Remove Individual Data Type Caps in Per-shard Buffering for Remote Write#8921
csmarchbanks merged 4 commits intoprometheus:mainfrom
LeviHarrison:exemplar-shard-buffer

Conversation

@LeviHarrison
Copy link
Contributor

@LeviHarrison LeviHarrison commented Jun 11, 2021

This PR consolidates the sampleBuffer and exemplarBuffer buffers into the pendingData buffer, removing the individual cap for each respective data type in favor of having a cap for all data points combined.

Fixes #8788

@roidelapluie
Copy link
Member

Thanks for this. @cstyan would you have some time to review it?

@LeviHarrison did you run any benchmarks on this?

@LeviHarrison
Copy link
Contributor Author

Not yet, I probably should.

@cstyan
Copy link
Member

cstyan commented Jun 30, 2021

I should be able to review next week, end of this week is a holiday here. Benchmarks with memory stats would be good :)

@LeviHarrison
Copy link
Contributor Author

I set up a benchmark using Grafana's TNS demo, so plenty of exemplars. I took this screenshot after it had been running for about 10 minutes.

test-dashboard

@LeviHarrison
Copy link
Contributor Author

After about 20 minutes:

benchmark-2

@LeviHarrison
Copy link
Contributor Author

Speaking of benchmarking... prometheus/test-infra#249

Copy link
Member

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on @LeviHarrison, looking good so far. The benchmark screenshots look good if we can see those results consistently; the same allocations and overall memory usage but better pending samples throughput.

It would be nice if we could benchmark this in a more automated fashion, especially for any more potential future work. Maybe BenchmarkSampleDelivery could be extended to also send exemplars?

Dave Cheney's post here has some good content on benchmarks: https://dave.cheney.net/high-performance-go-workshop/gophercon-2019.html#profiling_benchmarks

@LeviHarrison
Copy link
Contributor Author

// TODO(cstyan): Casting this many times smells, also we could get index out of bounds issues here.

I removed this comment from where maxExemplars was assigned, is it still valid?

@LeviHarrison
Copy link
Contributor Author

@cstyan I'll take a look at expanding the benchmarks. I think that will probably end up as a separate PR thought.

Copy link
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I think this is close, but left one performance concern, and I think a solution for it.

sampleBuffer[nPendingSamples][0] = d.sample
pendingData[nPending].Labels = labelsToLabelsProto(d.seriesLabels, pendingData[nPending].Labels)
pendingData[nPending].Samples = sampleBuffer[nPendingSamples]
pendingData[nPending].Samples = []prompb.Sample{d.sample}
Copy link
Member

@csmarchbanks csmarchbanks Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By creating a new slice of samples I fear that we are reducing some of the allocation optimizations made in #5849. I wonder if we could still pre-allocate samples and exemplars with len=0 and cap=1, then either set the value or truncate the slice as we use them?

Here is a benchmark you can use to see these concerns:

name               old time/op    new time/op    delta
SampleDelivery-16     15.5s ± 2%     16.9s ± 1%   +8.72%  (p=0.001 n=7+6)

name               old alloc/op   new alloc/op   delta
SampleDelivery-16    20.5GB ± 0%    21.7GB ± 0%   +5.85%  (p=0.001 n=7+7)

name               old allocs/op  new allocs/op  delta
SampleDelivery-16      176M ± 0%      201M ± 0%  +14.20%  (p=0.001 n=7+7)

The screenshots you posted don't show much of a difference, so it is possible that this micro benchmark is no longer valid, but it would be good to see for sure as at one point it was useful for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you're saying is that when we allocate pendingData we should also allocate len=0 series and exemplar slices to pre-populate the fields. The concern there was we'd over-allocate exemplars, but I think that if an allocation now happens for every single new sample, the initial penalty is well worth it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, if someone has send exemplars turned on chances are they have some exemplars so eventually all of those would be allocated anyway, and it is much better than creating garbage that needs to be cleaned up for every sample.

Copy link
Contributor Author

@LeviHarrison LeviHarrison Jul 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, after trying to implement this I don't think it will work. Before when there were two separate slices that only held one type (series or exemplar), they would get sent flushed to the index they had been filled, eliminated the possibility of a sample from the last cycle being sent again. But because now each slice can hold either a series or exemplar, the sample from the last cycle won't get wiped out if they aren't of the same type as the new sample. The only way no samples get re-flushed is if the slice is filled in the exact same order every single time, which isn't guaranteed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah this may add a fair bit more complexity. My thought is something like:

pendingData[nPending].Samples = pendingData[nPending].Samples[:0]
pendingData[nPending].Exemplars = pendingData[nPending].Exemplars[:0]
switch d := sample.(type) {
case writeSample:
    pendingData[nPending].Samples = append(pendingData[nPending].Samples, d.sample)
case writeExemplar:
    pendingData[nPending].Exemplars = append(pendingData[nPending].Exemplars, d.exemplar)
}

Do you think that would work? You might be able to use some ifs to avoid the truncations of the slices, but I don't know if it would help with performance at all.

Copy link
Contributor Author

@LeviHarrison LeviHarrison Jul 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, let's give it a try. Just pushed the change, tests pass. I'm not sure what benchmark you were using, if I was I would run it myself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been running:

go test -bench=BenchmarkSampleDelivery -benchmem -run=djfkasldf ./storage/remote

for the benchmark. Here are the results for before this change, without preallocation, and with preallocation:

benchstat before.txt no_preallocation.txt preallocation.txt
name \ time/op     before.txt   no_preallocation.txt  preallocation.txt
SampleDelivery-16   15.5s ± 2%            16.9s ± 1%         15.1s ± 1%

name \ alloc/op    before.txt   no_preallocation.txt  preallocation.txt
SampleDelivery-16  20.5GB ± 0%           21.7GB ± 0%        20.5GB ± 0%

name \ allocs/op   before.txt   no_preallocation.txt  preallocation.txt
SampleDelivery-16    176M ± 0%             201M ± 0%          176M ± 0%

So looks like we are back to the same amount of allocations now (yay!) and might even be a couple percent faster compared to before this PR.

Copy link
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

var pendingData = make([]prompb.TimeSeries, max)
for i := range pendingData {
pendingData[i].Samples = []prompb.Sample{{}}
pendingData[i].Exemplars = []prompb.Exemplar{{}}
Copy link
Member

@csmarchbanks csmarchbanks Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized: I guess we actually only need to allocate this if s.qm.sendExemplars? We would probably also have to protect line 1074 with an if as well at that point. Seems like it wouldn't make much of a difference in practice so 🤷.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True

@LeviHarrison
Copy link
Contributor Author

We could also go the approach of:

var reset func()
if !s.qm.sendExemplars {
  reset = func() {
    pendingData[nPending].Samples = pendingData[nPending].Samples[:0]
  }
} else {
  reset = func() {
    pendingData[nPending].Samples = pendingData[nPending].Samples[:0]
    pendingData[nPending].Exemplars = pendingData[nPending].Exemplars[:0]
  }
}

might be cleaner, maybe even better for performance, IDK

Signed-off-by: Levi Harrison <git@leviharrison.dev>
Signed-off-by: Levi Harrison <git@leviharrison.dev>
Signed-off-by: Levi Harrison <git@leviharrison.dev>
Signed-off-by: Levi Harrison <git@leviharrison.dev>
@LeviHarrison LeviHarrison force-pushed the exemplar-shard-buffer branch from 909ba1c to 38edd67 Compare July 27, 2021 16:42
Copy link
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the small new changes as well. @cstyan any other concerns on this one, otherwise lets merge it?

@cstyan
Copy link
Member

cstyan commented Aug 9, 2021

👍 lgtm

@csmarchbanks csmarchbanks merged commit fac1b57 into prometheus:main Aug 9, 2021
@LeviHarrison
Copy link
Contributor Author

Thanks for all your time reviewing!

codesome added a commit that referenced this pull request Aug 11, 2021
* Fix `kuma_sd` targetgroup reporting (#9157)

* Bundle all xDS targets into a single group

Signed-off-by: austin ce <austin.cawley@gmail.com>

* Snapshot in-memory chunks on shutdown for faster restarts (#7229)

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Rename links

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Remove Individual Data Type Caps in Per-shard Buffering for Remote Write (#8921)

* Moved everything to nPending buffer

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Simplify exemplar capacity addition

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Added pre-allocation

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Don't allocate if not sending exemplars

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Avoid deadlock when processing duplicate series record (#9170)

* Avoid deadlock when processing duplicate series record

`processWALSamples()` needs to be able to send on its output channel
before it can read the input channel, so reads to allow this in case the
output channel is full.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* processWALSamples: update comment

Previous text seems to relate to an earlier implementation.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Optimise WAL loading by removing extra map and caching min-time (#9160)

* BenchmarkLoadWAL: close WAL after use

So that goroutines are stopped and resources released

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* BenchmarkLoadWAL: make series IDs co-prime with #workers

Series are distributed across workers by taking the modulus of the
ID with the number of workers, so multiples of 100 are a poor choice.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* BenchmarkLoadWAL: simulate mmapped chunks

Real Prometheus cuts chunks every 120 samples, then skips those samples
when re-reading the WAL. Simulate this by creating a single mapped chunk
for each series, since the max time is all the reader looks at.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Fix comment

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Remove series map from processWALSamples()

The locks that is commented to reduce contention in are now sharded
32,000 ways, so won't be contended. Removing the map saves memory and
goes just as fast.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* loadWAL: Cache the last mmapped chunk time

So we can skip calling append() for samples it will reject.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Improvements from code review

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Full stops and capitals on comments

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Cache max time in both places mmappedChunks is updated

Including refactor to extract function `setMMappedChunks`, to reduce
code duplication.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Update head min/max time when mmapped chunks added

This ensures we have the correct values if no WAL samples are added for
that series.

Note that `mSeries.maxTime()` was always `math.MinInt64` before, since
that function doesn't consider mmapped chunks.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Split Go and React Tests (#8897)

* Added go-ci and react-ci

Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu>
Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Remove search keymap from new expression editor (#9184)

Signed-off-by: Julius Volz <julius.volz@gmail.com>

Co-authored-by: Austin Cawley-Edwards <austin.cawley@gmail.com>
Co-authored-by: Levi Harrison <git@leviharrison.dev>
Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu>
Co-authored-by: Bryan Boreham <bjboreham@gmail.com>
Co-authored-by: Julius Volz <julius.volz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

investigate per shard buffering improvements for remote write

4 participants