Prevent Prometheus from overallocating memory on subquery with large amount of steps. by alanprot · Pull Request #12734 · prometheus/prometheus

alanprot · 2023-08-22T00:31:29Z

Hi,

We just noticed that we can cause OOM on prometheus if we run queries like avg_over_time(metric[30d:30s]) before it hit the ErrTooManySamples as we allocate slices with the expected number of steps in the beginning of the query evaluation.

See for instance :

prometheus/promql/engine.go

Lines 1695 to 1700 in c579144

    
           _, f, h, ok := ev.vectorSelectorSingle(it, e, ts) 
        
           if ok { 
        
           	if ev.currentSamples < ev.maxSamples { 
        
           		if h == nil { 
        
           			if ss.Floats == nil { 
        
           				ss.Floats = getFPointSlice(numSteps)

This has 2 main consequences:

Prometheus may go OOM even if the query would eventually hit the ErrTooManySamples
Prometheus may use way more memory than needed to execute queries over sparse timeseries.

Ex:

Imagine that the query avg_over_time(metric[30d:30s]) match 10K series but each series only have 3 days worth of data. We also have a host with 50Gb and query.max-samples set to 50M (default limit) - with this config we would expect a single query would use at most ~= 6Gb (50M x 128 ~~bytes~~ bits - each FPoint has size of 128 ~~bytes~~ bits)

Now for this query we have:

Total number of samples queried per series matched (samplesPerSeries):
- 3d/30s = 8640
Total number of samples queried
- 10K * 8640 = 86.4M samples
- This is above the default limit (50M) so this query should fail with ErrTooManySamples
For each of the 10K series, we will allocate a slice of size 30d/30s = 86.4K` (instead of 8.6k as needed)

In this scenario we will only hit the ErrTooManySamples when we already evaluated 50M/8640 series = 5.7K series (limit/samplesPerSeries). At this point we have already allocated 5.7K * 86.4K * 128 ~~bytes~~ bits ~= 58Gb causing OOM (way more than the expected 6Gb).

This can also lead to use way more memory than needed even if we don't hit the ErrTooManySamples

Proposed solution

This PR is trying to get a more sensible initial capacity for the points slices in cases we have too many steps or series in order to allow the engine to throw ErrTooManySamples before going OOM. In order to do so, instead of always allocating the slices with capacity equals to the number of steps we would be calculating it by the following function:

func sizeToAllocate(numSteps, step, currentSamples, maxSamples, numberOfSeries int) int {
        // Subtract the current step from the numSteps as at this point we know the given series does not have data before this step
	remainingSteps := numSteps - step
        // spread the remaining allowed samples across all the series in the result.
	allowedSamples := (maxSamples - currentSamples) / numberOfSeries
	if remainingSteps > allowedSamples {
		return allowedSamples
	}
	return remainingSteps
}

In the case before, instead of allocating 86.4K points per series we would be allocating ~5K (50M/10K)which is less than the 8.6K needed but is a good start point. Its is also important to note that this should only make a difference if the query is relatively close to the query.max-samples limits - otherwise we will return the same number of steps.

yeya24 · 2023-08-23T21:55:05Z

Hi team, hope we can get some 👀 on this. It is pretty easy to OOM kill Prometheus with a subquery with very small step size up[180d:1ms].
We have step size 11K in range query API but we don't have similar things for subquery. The samples slice allocation is a big problem as even if I only have 1h data, by having a query of a long time range and a small subquery step, we allocate a large mount of memory.
cc @codesome as you are the original author of the subquery feature.

bboreham

Thank you for this PR, it seems the basic idea is valid.

each FPoint has size of 128 bytes)

It's 128 bits, or 16 bytes.

Could you explain the role of biggestLen in the calculation?

bboreham · 2023-08-24T09:26:54Z

promql/engine.go


-	for ts := ev.startTimestamp; ts <= ev.endTimestamp; ts += ev.interval {
+	for ts, step := ev.startTimestamp, -1; ts <= ev.endTimestamp; ts += ev.interval {
+		step++


I found 'step' quite confusing. Why is it initialized to -1 then immediately incremented, rather than initialized to 0 and incremented in the third part of the for ?

Make sense.. i will do on the other places as well.

bboreham · 2023-08-24T09:29:28Z

promql/engine.go

 			if sample.H == nil {
 				if ss.Floats == nil {
-					ss.Floats = getFPointSlice(numSteps)
+					ss.Floats = getFPointSlice(pointsSliceSize(numSteps, step, ev.currentSamples, ev.maxSamples, biggestLen))


Seems ugly repeating this six times. Could the logic move inside getFPointSlice, and make it a member of evaluator so it has current and max samples available?

I will do that but there is some cases where we fix the len to 16.
I will do the change and we can see if its too ugly! hahah

prometheus/promql/engine.go

Line 2089 in 8ef7dfd

floats = getFPointSlice(16)

alanprot · 2023-08-24T17:28:13Z

It's 128 bits, or 16 bytes.

Ops.. my bad! :D Fixed on the description!

biggestLen

This seems to keep track of the matrix with the most time series.. that's why i used this value on this case. WDYT?

prometheus/promql/engine.go

Lines 1136 to 1144 in 8ef7dfd

    
           // Create an output vector that is as big as the input matrix with 
        
           // the most time series. 
        
           biggestLen := 1 
        
           for i := range exprs { 
        
           	vectors[i] = make(Vector, 0, len(matrixes[i])) 
        
           	if len(matrixes[i]) > biggestLen { 
        
           		biggestLen = len(matrixes[i]) 
        
           	} 
        
           }

bwplotka

Thanks! Good investigation 💪🏽

I would vote for simpler heuristic for a start, but not a blocker. LGTM.

Do you mind adding/extending unit test?

promql/engine.go

bwplotka · 2023-08-25T18:24:40Z

promql/engine.go

+	// Subtract the current step from the numSteps as at this point we know the given series does not have data before this step
+	remainingSteps := numSteps - step
+	// spread the remaining allowed samples across all the series in the result.
+	allowedSamples := (ev.maxSamples - ev.currentSamples) / numberOfSeries


I am not sure this complex heuristic is really needed. It guesses a lot, plus it requires number of series, which blocks streaming approaches at some point (nit). I wonder if simple return remainingStep if it's smaller than (ev.maxSamples - ev.currentSamples) is not enough for a start?

oh.. would this work?

The problem here is i think it will cause the same problem.. cause the current samples will be very low for sparse series, and we will end up allocating the whole steps to all the series.

The crux of the problem is exactly that we create big slices but the ev.currentSamples is very low (as the series are sparse), right?

Got it, sorry misunderstood the root cause within that suggestion. We can have trillion steps, but if we have data only for subset of those, we should still return query just fine.

Still, in our main discussion we mentioned the logic of min(5K, step) pre-alloc (or 10k). This could work fine for start. If we are worried about that very sparse case, we add recompact slice logic after we fill it with points if really needed (although question is how long this slice will live). Your idea with min(((ev.maxSamples - ev.currentSamples) / numberOfSeries), step) is not bad, just it's pure guess and can preallocate a lot still. It's even a bit weird that the more series we process the less we preallocate? Anyway, both fine I guess, just the simpler is equally efficient in generic unknown environments?

It's even a bit weird that the more series we process the less we preallocate

Yeah... but this is the downside but we have to keep in mind that if we are running far from the ev.maxSamples this condition will never happen... And if we are running close to it we are allocating less per series and so give change to hit that limit before going OOM.

I think to prevent what we saw we could do min(5K, step) or we need to divide by the number of series.

One thing i wanna point out is that queries over sparse series in this case are not uncommon if we are talking about 30 days worth of data (if you deploy once every 2 days you will have 2 days worth of data on mainly of your series). And for those series we are always allocating slices to fit 30 days steps even though we only have data for 2 days.

alanprot · 2023-08-25T18:43:39Z

I would vote for simpler heuristic for a start, but not a blocker. LGTM.
U have any idea? This one was the one i thought was simpler an would impact less the current behavior..

We could just limit the max number of points.. like.. never allocated more than 5K?

bwplotka · 2023-08-25T18:44:16Z

Yup 👍🏽

alanprot · 2023-08-25T18:55:09Z

Ok.. i could do that! Hopefully we pick a max (5K or 10K) that will be a no-op in most of the cases)

@bboreham thoughts?

alanprot · 2023-08-30T19:04:59Z

@bboreham PTAL?

bboreham · 2023-09-04T09:56:30Z

Agreed, let's try something simpler. If it's reallocating 5K slices then probably the query is already doing a lot of work so reallocation is not the important part.

I looked in to biggestLen - it is the input to the operation with the most series.
This is a poor guide to what we need on the output, which is what you are trying to change.
E.g. sum(foo) has N input series and 1 output series.
foo or bar has all the series from foo and bar, so more than biggestLen.

alanprot · 2023-09-07T21:02:54Z

Ok.. i updated the PR with the simpler approach! :D

Signed-off-by: Alan Protasio <alanprot@gmail.com>

…method to the evaluator Signed-off-by: Alan Protasio <alanprot@gmail.com>

Signed-off-by: Alan Protasio <alanprot@gmail.com>

Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com> Signed-off-by: Alan Protasio <alanprot@gmail.com>

Signed-off-by: Alan Protasio <alanprot@gmail.com>

bboreham

LGTM, but might expand the comment a bit.

promql/engine.go

Signed-off-by: Alan Protasio <alanprot@gmail.com>

bwplotka

👍🏽

alanprot requested a review from roidelapluie as a code owner August 22, 2023 00:31

alanprot force-pushed the slice-size branch from c601fa0 to c01373e Compare August 22, 2023 00:35

bboreham reviewed Aug 24, 2023

View reviewed changes

alanprot force-pushed the slice-size branch from 6d3bd58 to a5afb0e Compare August 24, 2023 17:52

bwplotka changed the title ~~Prometheus goes OOM if query has lots of steps before hitting ErrTooManySamples~~ Prevent Prometheus from overallocating memory on subquery with large amount of steps. Aug 25, 2023

bwplotka approved these changes Aug 25, 2023

View reviewed changes

bwplotka requested a review from bboreham August 25, 2023 19:45

alanprot mentioned this pull request Sep 5, 2023

Prevent creating Point slices with a cap larger than the remaining samples #8033

Closed

alanprot and others added 8 commits September 7, 2023 16:59

change initial points slice size

b3acd7a

Signed-off-by: Alan Protasio <alanprot@gmail.com>

refactor on the steps calculation and moving the getXPoint/putXPoint …

86ce9c6

…method to the evaluator Signed-off-by: Alan Protasio <alanprot@gmail.com>

prevent potential panic

7ae9cd5

Signed-off-by: Alan Protasio <alanprot@gmail.com>

Update promql/engine.go

6e1305e

Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com> Signed-off-by: Alan Protasio <alanprot@gmail.com>

Update promql/engine.go

bf127e0

Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com> Signed-off-by: Alan Protasio <alanprot@gmail.com>

Update promql/engine.go

03df8ca

Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com> Signed-off-by: Alan Protasio <alanprot@gmail.com>

Update promql/engine.go

fef43e8

Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com> Signed-off-by: Alan Protasio <alanprot@gmail.com>

Allocating slice with maximum size of 5k

2adea2a

Signed-off-by: Alan Protasio <alanprot@gmail.com>

alanprot force-pushed the slice-size branch from 294e8c5 to 2adea2a Compare September 8, 2023 00:00

bboreham approved these changes Sep 19, 2023

View reviewed changes

promql/engine.go Show resolved Hide resolved

adding comments

4bac449

Signed-off-by: Alan Protasio <alanprot@gmail.com>

bwplotka approved these changes Sep 25, 2023

View reviewed changes

bwplotka merged commit a15e884 into prometheus:main Sep 25, 2023

alanprot mentioned this pull request Jan 23, 2024

PromQL: improve FPointSlice and HPointSlice allocation. #13448

Merged

bwplotka mentioned this pull request Feb 11, 2025

query: skip over missing samples for vector expressions #13745

Closed

	_, f, h, ok := ev.vectorSelectorSingle(it, e, ts)
	if ok {
	if ev.currentSamples < ev.maxSamples {
	if h == nil {
	if ss.Floats == nil {
	ss.Floats = getFPointSlice(numSteps)

Conversation

alanprot commented Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed solution

Uh oh!

yeya24 commented Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bboreham left a comment

Choose a reason for hiding this comment

Uh oh!

bboreham Aug 24, 2023

Choose a reason for hiding this comment

Uh oh!

alanprot Aug 24, 2023

Choose a reason for hiding this comment

Uh oh!

bboreham Aug 24, 2023

Choose a reason for hiding this comment

Uh oh!

alanprot Aug 24, 2023

Choose a reason for hiding this comment

Uh oh!

alanprot commented Aug 24, 2023

Uh oh!

bwplotka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bwplotka Aug 25, 2023

Choose a reason for hiding this comment

Uh oh!

alanprot Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bwplotka Aug 25, 2023

Choose a reason for hiding this comment

Uh oh!

alanprot Aug 25, 2023

Choose a reason for hiding this comment

Uh oh!

alanprot commented Aug 25, 2023

Uh oh!

bwplotka commented Aug 25, 2023

Uh oh!

alanprot commented Aug 25, 2023

Uh oh!

alanprot commented Aug 30, 2023

Uh oh!

bboreham commented Sep 4, 2023

Uh oh!

alanprot commented Sep 7, 2023

Uh oh!

bboreham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bwplotka left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alanprot commented Aug 22, 2023 •

edited

Loading

yeya24 commented Aug 23, 2023 •

edited

Loading

alanprot Aug 25, 2023 •

edited

Loading