Skip to content

Prevent Prometheus from overallocating memory on subquery with large amount of steps.#12734

Merged
bwplotka merged 9 commits intoprometheus:mainfrom
alanprot:slice-size
Sep 25, 2023
Merged

Prevent Prometheus from overallocating memory on subquery with large amount of steps.#12734
bwplotka merged 9 commits intoprometheus:mainfrom
alanprot:slice-size

Conversation

@alanprot
Copy link
Copy Markdown
Contributor

@alanprot alanprot commented Aug 22, 2023

Hi,

We just noticed that we can cause OOM on prometheus if we run queries like avg_over_time(metric[30d:30s]) before it hit the ErrTooManySamples as we allocate slices with the expected number of steps in the beginning of the query evaluation.

See for instance :

prometheus/promql/engine.go

Lines 1695 to 1700 in c579144

_, f, h, ok := ev.vectorSelectorSingle(it, e, ts)
if ok {
if ev.currentSamples < ev.maxSamples {
if h == nil {
if ss.Floats == nil {
ss.Floats = getFPointSlice(numSteps)

This has 2 main consequences:

  • Prometheus may go OOM even if the query would eventually hit the ErrTooManySamples
  • Prometheus may use way more memory than needed to execute queries over sparse timeseries.

Ex:

Imagine that the query avg_over_time(metric[30d:30s]) match 10K series but each series only have 3 days worth of data. We also have a host with 50Gb and query.max-samples set to 50M (default limit) - with this config we would expect a single query would use at most ~= 6Gb (50M x 128 bytes bits - each FPoint has size of 128 bytes bits)

Now for this query we have:

  • Total number of samples queried per series matched (samplesPerSeries):
    • 3d/30s = 8640
  • Total number of samples queried
    • 10K * 8640 = 86.4M samples
    • This is above the default limit (50M) so this query should fail with ErrTooManySamples
  • For each of the 10K series, we will allocate a slice of size 30d/30s = 86.4K` (instead of 8.6k as needed)

In this scenario we will only hit the ErrTooManySamples when we already evaluated 50M/8640 series = 5.7K series (limit/samplesPerSeries). At this point we have already allocated 5.7K * 86.4K * 128 bytes bits ~= 58Gb causing OOM (way more than the expected 6Gb).

This can also lead to use way more memory than needed even if we don't hit the ErrTooManySamples

Proposed solution

This PR is trying to get a more sensible initial capacity for the points slices in cases we have too many steps or series in order to allow the engine to throw ErrTooManySamples before going OOM. In order to do so, instead of always allocating the slices with capacity equals to the number of steps we would be calculating it by the following function:

func sizeToAllocate(numSteps, step, currentSamples, maxSamples, numberOfSeries int) int {
        // Subtract the current step from the numSteps as at this point we know the given series does not have data before this step
	remainingSteps := numSteps - step
        // spread the remaining allowed samples across all the series in the result.
	allowedSamples := (maxSamples - currentSamples) / numberOfSeries
	if remainingSteps > allowedSamples {
		return allowedSamples
	}
	return remainingSteps
}

In the case before, instead of allocating 86.4K points per series we would be allocating ~5K (50M/10K)which is less than the 8.6K needed but is a good start point. Its is also important to note that this should only make a difference if the query is relatively close to the query.max-samples limits - otherwise we will return the same number of steps.

@yeya24
Copy link
Copy Markdown
Contributor

yeya24 commented Aug 23, 2023

Hi team, hope we can get some 👀 on this. It is pretty easy to OOM kill Prometheus with a subquery with very small step size up[180d:1ms].
We have step size 11K in range query API but we don't have similar things for subquery. The samples slice allocation is a big problem as even if I only have 1h data, by having a query of a long time range and a small subquery step, we allocate a large mount of memory.
cc @codesome as you are the original author of the subquery feature.

Copy link
Copy Markdown
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR, it seems the basic idea is valid.

each FPoint has size of 128 bytes)

It's 128 bits, or 16 bytes.

Could you explain the role of biggestLen in the calculation?

promql/engine.go Outdated

for ts := ev.startTimestamp; ts <= ev.endTimestamp; ts += ev.interval {
for ts, step := ev.startTimestamp, -1; ts <= ev.endTimestamp; ts += ev.interval {
step++
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found 'step' quite confusing. Why is it initialized to -1 then immediately incremented, rather than initialized to 0 and incremented in the third part of the for ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.. i will do on the other places as well.

promql/engine.go Outdated
if sample.H == nil {
if ss.Floats == nil {
ss.Floats = getFPointSlice(numSteps)
ss.Floats = getFPointSlice(pointsSliceSize(numSteps, step, ev.currentSamples, ev.maxSamples, biggestLen))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems ugly repeating this six times. Could the logic move inside getFPointSlice, and make it a member of evaluator so it has current and max samples available?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do that but there is some cases where we fix the len to 16.
I will do the change and we can see if its too ugly! hahah

floats = getFPointSlice(16)

@alanprot
Copy link
Copy Markdown
Contributor Author

It's 128 bits, or 16 bytes.

Ops.. my bad! :D Fixed on the description!

biggestLen

This seems to keep track of the matrix with the most time series.. that's why i used this value on this case. WDYT?

prometheus/promql/engine.go

Lines 1136 to 1144 in 8ef7dfd

// Create an output vector that is as big as the input matrix with
// the most time series.
biggestLen := 1
for i := range exprs {
vectors[i] = make(Vector, 0, len(matrixes[i]))
if len(matrixes[i]) > biggestLen {
biggestLen = len(matrixes[i])
}
}

@bwplotka bwplotka changed the title Prometheus goes OOM if query has lots of steps before hitting ErrTooManySamples Prevent Prometheus from overallocating memory on subquery with large amount of steps. Aug 25, 2023
Copy link
Copy Markdown
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Good investigation 💪🏽

I would vote for simpler heuristic for a start, but not a blocker. LGTM.

Do you mind adding/extending unit test?

promql/engine.go Outdated
// Subtract the current step from the numSteps as at this point we know the given series does not have data before this step
remainingSteps := numSteps - step
// spread the remaining allowed samples across all the series in the result.
allowedSamples := (ev.maxSamples - ev.currentSamples) / numberOfSeries
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this complex heuristic is really needed. It guesses a lot, plus it requires number of series, which blocks streaming approaches at some point (nit). I wonder if simple return remainingStep if it's smaller than (ev.maxSamples - ev.currentSamples) is not enough for a start?

Copy link
Copy Markdown
Contributor Author

@alanprot alanprot Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh.. would this work?

The problem here is i think it will cause the same problem.. cause the current samples will be very low for sparse series, and we will end up allocating the whole steps to all the series.

The crux of the problem is exactly that we create big slices but the ev.currentSamples is very low (as the series are sparse), right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, sorry misunderstood the root cause within that suggestion. We can have trillion steps, but if we have data only for subset of those, we should still return query just fine.

Still, in our main discussion we mentioned the logic of min(5K, step) pre-alloc (or 10k). This could work fine for start. If we are worried about that very sparse case, we add recompact slice logic after we fill it with points if really needed (although question is how long this slice will live). Your idea with min(((ev.maxSamples - ev.currentSamples) / numberOfSeries), step) is not bad, just it's pure guess and can preallocate a lot still. It's even a bit weird that the more series we process the less we preallocate? Anyway, both fine I guess, just the simpler is equally efficient in generic unknown environments?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's even a bit weird that the more series we process the less we preallocate

Yeah... but this is the downside but we have to keep in mind that if we are running far from the ev.maxSamples this condition will never happen... And if we are running close to it we are allocating less per series and so give change to hit that limit before going OOM.

I think to prevent what we saw we could do min(5K, step) or we need to divide by the number of series.

One thing i wanna point out is that queries over sparse series in this case are not uncommon if we are talking about 30 days worth of data (if you deploy once every 2 days you will have 2 days worth of data on mainly of your series). And for those series we are always allocating slices to fit 30 days steps even though we only have data for 2 days.

@alanprot
Copy link
Copy Markdown
Contributor Author

I would vote for simpler heuristic for a start, but not a blocker. LGTM.
U have any idea? This one was the one i thought was simpler an would impact less the current behavior..

We could just limit the max number of points.. like.. never allocated more than 5K?

@bwplotka
Copy link
Copy Markdown
Member

Yup 👍🏽

@alanprot
Copy link
Copy Markdown
Contributor Author

Ok.. i could do that! Hopefully we pick a max (5K or 10K) that will be a no-op in most of the cases)

@bboreham thoughts?

@bwplotka bwplotka requested a review from bboreham August 25, 2023 19:45
@alanprot
Copy link
Copy Markdown
Contributor Author

@bboreham PTAL?

@bboreham
Copy link
Copy Markdown
Member

bboreham commented Sep 4, 2023

Agreed, let's try something simpler. If it's reallocating 5K slices then probably the query is already doing a lot of work so reallocation is not the important part.

I looked in to biggestLen - it is the input to the operation with the most series.
This is a poor guide to what we need on the output, which is what you are trying to change.
E.g. sum(foo) has N input series and 1 output series.
foo or bar has all the series from foo and bar, so more than biggestLen.

@alanprot
Copy link
Copy Markdown
Contributor Author

alanprot commented Sep 7, 2023

Ok.. i updated the PR with the simpler approach! :D

alanprot and others added 8 commits September 7, 2023 16:59
Signed-off-by: Alan Protasio <alanprot@gmail.com>
…method to the evaluator

Signed-off-by: Alan Protasio <alanprot@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>
Copy link
Copy Markdown
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but might expand the comment a bit.

Signed-off-by: Alan Protasio <alanprot@gmail.com>
Copy link
Copy Markdown
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants