Prevent Prometheus from overallocating memory on subquery with large amount of steps.#12734
Prevent Prometheus from overallocating memory on subquery with large amount of steps.#12734bwplotka merged 9 commits intoprometheus:mainfrom
Conversation
|
Hi team, hope we can get some 👀 on this. It is pretty easy to OOM kill Prometheus with a subquery with very small step size |
promql/engine.go
Outdated
|
|
||
| for ts := ev.startTimestamp; ts <= ev.endTimestamp; ts += ev.interval { | ||
| for ts, step := ev.startTimestamp, -1; ts <= ev.endTimestamp; ts += ev.interval { | ||
| step++ |
There was a problem hiding this comment.
I found 'step' quite confusing. Why is it initialized to -1 then immediately incremented, rather than initialized to 0 and incremented in the third part of the for ?
There was a problem hiding this comment.
Make sense.. i will do on the other places as well.
promql/engine.go
Outdated
| if sample.H == nil { | ||
| if ss.Floats == nil { | ||
| ss.Floats = getFPointSlice(numSteps) | ||
| ss.Floats = getFPointSlice(pointsSliceSize(numSteps, step, ev.currentSamples, ev.maxSamples, biggestLen)) |
There was a problem hiding this comment.
Seems ugly repeating this six times. Could the logic move inside getFPointSlice, and make it a member of evaluator so it has current and max samples available?
There was a problem hiding this comment.
I will do that but there is some cases where we fix the len to 16.
I will do the change and we can see if its too ugly! hahah
Line 2089 in 8ef7dfd
Ops.. my bad! :D Fixed on the description!
This seems to keep track of the matrix with the most time series.. that's why i used this value on this case. WDYT? Lines 1136 to 1144 in 8ef7dfd |
ErrTooManySamples
bwplotka
left a comment
There was a problem hiding this comment.
Thanks! Good investigation 💪🏽
I would vote for simpler heuristic for a start, but not a blocker. LGTM.
Do you mind adding/extending unit test?
promql/engine.go
Outdated
| // Subtract the current step from the numSteps as at this point we know the given series does not have data before this step | ||
| remainingSteps := numSteps - step | ||
| // spread the remaining allowed samples across all the series in the result. | ||
| allowedSamples := (ev.maxSamples - ev.currentSamples) / numberOfSeries |
There was a problem hiding this comment.
I am not sure this complex heuristic is really needed. It guesses a lot, plus it requires number of series, which blocks streaming approaches at some point (nit). I wonder if simple return remainingStep if it's smaller than (ev.maxSamples - ev.currentSamples) is not enough for a start?
There was a problem hiding this comment.
oh.. would this work?
The problem here is i think it will cause the same problem.. cause the current samples will be very low for sparse series, and we will end up allocating the whole steps to all the series.
The crux of the problem is exactly that we create big slices but the ev.currentSamples is very low (as the series are sparse), right?
There was a problem hiding this comment.
Got it, sorry misunderstood the root cause within that suggestion. We can have trillion steps, but if we have data only for subset of those, we should still return query just fine.
Still, in our main discussion we mentioned the logic of min(5K, step) pre-alloc (or 10k). This could work fine for start. If we are worried about that very sparse case, we add recompact slice logic after we fill it with points if really needed (although question is how long this slice will live). Your idea with min(((ev.maxSamples - ev.currentSamples) / numberOfSeries), step) is not bad, just it's pure guess and can preallocate a lot still. It's even a bit weird that the more series we process the less we preallocate? Anyway, both fine I guess, just the simpler is equally efficient in generic unknown environments?
There was a problem hiding this comment.
It's even a bit weird that the more series we process the less we preallocate
Yeah... but this is the downside but we have to keep in mind that if we are running far from the ev.maxSamples this condition will never happen... And if we are running close to it we are allocating less per series and so give change to hit that limit before going OOM.
I think to prevent what we saw we could do min(5K, step) or we need to divide by the number of series.
One thing i wanna point out is that queries over sparse series in this case are not uncommon if we are talking about 30 days worth of data (if you deploy once every 2 days you will have 2 days worth of data on mainly of your series). And for those series we are always allocating slices to fit 30 days steps even though we only have data for 2 days.
We could just limit the max number of points.. like.. never allocated more than 5K? |
|
Yup 👍🏽 |
|
Ok.. i could do that! Hopefully we pick a max (5K or 10K) that will be a no-op in most of the cases) @bboreham thoughts? |
|
@bboreham PTAL? |
|
Agreed, let's try something simpler. If it's reallocating 5K slices then probably the query is already doing a lot of work so reallocation is not the important part. I looked in to |
|
Ok.. i updated the PR with the simpler approach! :D |
Signed-off-by: Alan Protasio <alanprot@gmail.com>
…method to the evaluator Signed-off-by: Alan Protasio <alanprot@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com> Signed-off-by: Alan Protasio <alanprot@gmail.com>
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com> Signed-off-by: Alan Protasio <alanprot@gmail.com>
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com> Signed-off-by: Alan Protasio <alanprot@gmail.com>
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com> Signed-off-by: Alan Protasio <alanprot@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>
bboreham
left a comment
There was a problem hiding this comment.
LGTM, but might expand the comment a bit.
Signed-off-by: Alan Protasio <alanprot@gmail.com>
Hi,
We just noticed that we can cause OOM on prometheus if we run queries like
avg_over_time(metric[30d:30s])before it hit theErrTooManySamplesas we allocate slices with theexpected number of stepsin the beginning of the query evaluation.See for instance :
prometheus/promql/engine.go
Lines 1695 to 1700 in c579144
This has 2 main consequences:
ErrTooManySamplesEx:
Imagine that the query
avg_over_time(metric[30d:30s])match 10K series but each series only have 3 days worth of data. We also have a host with 50Gb andquery.max-samplesset to 50M (default limit) - with this config we would expect a single query would use at most ~= 6Gb (50M x 128bytesbits - eachFPointhas size of 128bytesbits)Now for this query we have:
ErrTooManySamples30d/30s= 86.4K` (instead of 8.6k as needed)In this scenario we will only hit the
ErrTooManySampleswhen we already evaluated 50M/8640 series = 5.7K series (limit/samplesPerSeries). At this point we have already allocated 5.7K * 86.4K * 128bytesbits ~= 58Gb causing OOM (way more than the expected 6Gb).This can also lead to use way more memory than needed even if we don't hit the
ErrTooManySamplesProposed solution
This PR is trying to get a more sensible initial capacity for the points slices in cases we have too many steps or series in order to allow the engine to throw
ErrTooManySamplesbefore going OOM. In order to do so, instead of always allocating the slices with capacity equals to the number of steps we would be calculating it by the following function:In the case before, instead of allocating 86.4K points per series we would be allocating ~5K (50M/10K)which is less than the 8.6K needed but is a good start point. Its is also important to note that this should only make a difference if the query is relatively close to the
query.max-sampleslimits - otherwise we will return the same number of steps.