Remote write should retry on 429

## Proposal

http code 429, meaning "too many requests", should be handled as a retryable condition by Prometheus remote write.

One scenario which has been observed using Cortex: if the backend goes down for ten minutes, all sending Prometheus will build up ten minutes of data in their WALs. Once service is restored they will send as fast as possible, even increasing the number of shards to go even faster.  Since the backend does not have resources to accept all sends in parallel, it has to choose between rejecting sends with a 5xx error so that they are retried, or dropping the data.  (Currently Cortex is coded to drop)

Currently, all 4xx errors cause the send to be dropped by Prometheus with an error report in the log, while all 5xx errors are retried.  429 could be changed to retry, with a limit on build-up of data if the overload is not temporary.

429 also allows a "retry-after" header so the back-end can indicate how much slower it would like Prometheus to go.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote write should retry on 429 #8418

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Remote write should retry on 429 #8418

Description

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions