-
Notifications
You must be signed in to change notification settings - Fork 10.2k
Description
Proposal
http code 429, meaning "too many requests", should be handled as a retryable condition by Prometheus remote write.
One scenario which has been observed using Cortex: if the backend goes down for ten minutes, all sending Prometheus will build up ten minutes of data in their WALs. Once service is restored they will send as fast as possible, even increasing the number of shards to go even faster. Since the backend does not have resources to accept all sends in parallel, it has to choose between rejecting sends with a 5xx error so that they are retried, or dropping the data. (Currently Cortex is coded to drop)
Currently, all 4xx errors cause the send to be dropped by Prometheus with an error report in the log, while all 5xx errors are retried. 429 could be changed to retry, with a limit on build-up of data if the overload is not temporary.
429 also allows a "retry-after" header so the back-end can indicate how much slower it would like Prometheus to go.