Skip to content

Remote write should retry on 429 #8418

@bboreham

Description

@bboreham

Proposal

http code 429, meaning "too many requests", should be handled as a retryable condition by Prometheus remote write.

One scenario which has been observed using Cortex: if the backend goes down for ten minutes, all sending Prometheus will build up ten minutes of data in their WALs. Once service is restored they will send as fast as possible, even increasing the number of shards to go even faster. Since the backend does not have resources to accept all sends in parallel, it has to choose between rejecting sends with a 5xx error so that they are retried, or dropping the data. (Currently Cortex is coded to drop)

Currently, all 4xx errors cause the send to be dropped by Prometheus with an error report in the log, while all 5xx errors are retried. 429 could be changed to retry, with a limit on build-up of data if the overload is not temporary.

429 also allows a "retry-after" header so the back-end can indicate how much slower it would like Prometheus to go.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions