-
Notifications
You must be signed in to change notification settings - Fork 10.2k
Description
What did you do?
Upgraded Prometheus from v2.32.1 to v2.33.1
What did you expect to see?
Similar performance and smooth graphs.
What did you see instead? Under which circumstances?
It appears that starting with v2.33.0 Prometheus hits some scalability limits for us.
Something takes longer than it used to OR it started to block when it didn't use to.
We've started noticing that ever since we've upgrade Prometheus from v2.32.1 to v2.33.1 (same issue with v2.33.4) we're seeing an issue where on our biggest instances every 30 minutes we have:
- some counter updates are delayed, which looks like either scrapes are getting delayed - I only mention counters here because it's more visible on counters vs gauge, so either the actual HTTP scrape is delayed OR sample insertion to tsdb is delayed (or wherever sample timestamp is set on scrape)
- we see a massive spike in rule evaluation duration
- we see rule evaluation duration spike
- we see rule evaluation timeouts
query timed out in expression evaluation
Still trying to debug it, so far it doesn't seem to be related to:
- queries - we don't see any elevated spike in queries
- go-routines - they stay flat, so it's not like Go gets so many goroutines that some gets left behind
- cpu or memory - don't see any elevated resource usage when this happens
- chunk write queue - since this is new code added in 2.33 with a default queue size of 1000 and our metrics show that the rate of elements added to the queue spikes to around 250k/s I tested Prometheus with a bigger queue size (up to 50M) with no effect on this issue
- query concurrency limit - since that's 20 by default and we seem to usually issue more queries per second I suspected we might be queuing queries too much, but bumping this value up to 96 (on a server with 128 cores) didn't seem to change anything
Since this is happening every 30 minutes and only seem to affect our biggest instances with ~15M time series I've checked what else is happening every 30 minutes. We do have --storage.tsdb.min-block-duration=30m & --storage.tsdb.max-block-duration=30m, mostly to reduce memory usage as we do have a fair amount of metrics churn and so more frequent HEAD compaction helps keep memory usage lower than it would be without it.
What I've also noticed is that tsdb HEAD active appenders are spiking around the time of this issue. Likely because they spike when there's HEAD/block compaction, so not sure if that's the effect or the cause. Looking at historical metrics I see that it was always spiking around that time, but with 2.33 spikes are bigger. See metrics below:
19dm12 - v2.32.1
19dm13 - v2.33.4
I didn't find any useful logs that would point me in any other direction so far and not sure what other metrics might be relevant here. Any tips on further debugging would be very helpful.
Environment
-
System information:
Linux 5.15.19 x86_64 -
Prometheus version:
insert output of
prometheus --versionhere -
Alertmanager version:
insert output of
alertmanager --versionhere (if relevant to the issue) -
Prometheus configuration file:
insert configuration here
- Alertmanager configuration file:
insert configuration here (if relevant to the issue)
- Logs:
insert Prometheus and Alertmanager logs relevant to the issue here
