Consider status code 429 as recoverable errors to avoid resharding#8237
Consider status code 429 as recoverable errors to avoid resharding#8237csmarchbanks merged 2 commits intoprometheus:masterfrom
Conversation
|
This is a breaking change. I think this is a good idea, because it is very annoying in practice, but should be configurable. |
|
@roidelapluie with configurable, do you mean to give users the ability to enable to disable backoff for rate-limiting? |
|
retryability, yes. (but I am not a remote write maintainer) |
|
First, thanks for working on this, I think something like this behavior is certainly useful as dropping 429s indiscriminately is not ideal. That said, I am not sure we ever decided what the best behavior on a 429 is. If you encounter a 429 in regular operation, chances are your remote write will just continue to fall behind until you force a restart dropping a significant chunk of data. Right now the 429 behavior can be useful because dropping some samples basically means less resolution in your data which might be better for many people not having any recent data at all. Configuration of what to do on a 429 would be ok, but are there other ideas for handling rate limits more gracefully? Sending a new request that will probably fail 30ms later doesn't seem great. Limited number of retries + respect for the |
|
I didn't get the retry after thing. Is it something that remote storage will respond to the remote write component (if yes, then it would be a response, how header?), a time after which only, it should retry? |
|
The To be clear, I am not saying implement the Retry-After logic. I am just not sure the 429 behavior should be the exact same as 5XX retry behavior, and would like a discussion. I am curious what @cstyan thinks too. |
|
Ah, I feel like rate limiting is a thing from the remote storage. That means, the remote storage should get more control as to how it wants a particular request to behave. This gives itself the ability to handle the situation and come out of it. So, the response header should be a good way out here. One header can be, |
|
If there's a standard for how to respond to 429s via the Retry-After header I think we should support it. I'm hesitant to include an option for |
|
Ah, any suggestions on how we should move ahead here? |
|
Supporting |
3b61ad8 to
33922ac
Compare
|
@csmarchbanks @cstyan I have added the support for |
csmarchbanks
left a comment
There was a problem hiding this comment.
I am in favor of supporting the Retry-After header, but I don't think we should add Max-Retries as a header response or as part of this PR. Looking at #8237 (comment), Callum is also hesitant, so let's focus on just 429 behavior and Retry-After in this PR.
storage/remote/queue_manager.go
Outdated
| // If we make it this far, we've encountered a recoverable error and will retry. | ||
| onRetry() | ||
| level.Debug(l).Log("msg", "failed to send batch, retrying", "err", err) | ||
| level.Debug(l).Log("msg", "failed to send batch, retrying", "err", backoffErr.Error()) |
There was a problem hiding this comment.
Somewhat a reminder for me in case a rebase doesn't go well, this should be Warn level, and failed should be Failed.
There was a problem hiding this comment.
I have implemented this in the recent push.
563bc8d to
1778657
Compare
csmarchbanks
left a comment
There was a problem hiding this comment.
Some good progress, a couple ideas mostly to simplify the code!
storage/remote/queue_manager.go
Outdated
| } | ||
| onRetry() | ||
| } else { | ||
| level.Info(l).Log("msg", "retry-after cannot be in past, retrying the default backoff way") |
There was a problem hiding this comment.
I think this could be a debug log. I would expect it to happen sometimes and it is a reasonable behavior. You could also get rid of it by always sleeping the larger amount of backoff or retryAfter.
There was a problem hiding this comment.
sleeping the larger amount of backoff or retryAfter
Sorry, but how can I get the larger of the two when retryAfter is itself in past? Am I missing something?
1778657 to
1ad4361
Compare
1ad4361 to
8bf7442
Compare
config/config.go
Outdated
| // Backoff times for retrying a batch of samples on recoverable errors. | ||
| MinBackoff: model.Duration(30 * time.Millisecond), | ||
| MaxBackoff: model.Duration(100 * time.Millisecond), | ||
| MaxBackoff: model.Duration(1000 * time.Millisecond), |
There was a problem hiding this comment.
Hmm, the default max backoff change is not really related to handling 429s better. Could that be part of a different PR if we want to do it?
There was a problem hiding this comment.
Sure, I have reverted this change.
storage/remote/client.go
Outdated
| parsedDuration, err := time.Parse(time.RFC1123, t) | ||
| if err == nil { | ||
| if parsedDuration.Before(time.Now()) { | ||
| return invalidRetryAfter |
There was a problem hiding this comment.
If we check for < 0 later on we do not need this.
There was a problem hiding this comment.
We can do this way as well or have a unit test for this function (which seems more reliable). For now, I am going with the later. I am happy to revert if we require.
There was a problem hiding this comment.
Yeah, the more I think about it the more I would prefer the check for < 0 rather than depending on -1. A whole lot more invalid cases are covered with that check in the case of future changes.
8bf7442 to
42b07b4
Compare
|
@roidelapluie got it! |
| parsedDuration, err := time.Parse(http.TimeFormat, t) | ||
| if err == nil { | ||
| s := time.Until(parsedDuration).Seconds() | ||
| return model.Duration(s) * model.Duration(time.Second) |
There was a problem hiding this comment.
| return model.Duration(s) * model.Duration(time.Second) | |
| return model.Duration(s * time.Second) |
There was a problem hiding this comment.
Sorry, this is a compilation error. We cannot multiply s (float64) and time.Second (time.Duration) and return to a model.Duration
|
LGTM after the last nits. I think that the logs would log that there is a 429, so that should be good for users to understand what is going on. |
6d12009 to
0e474bb
Compare
|
@csmarchbanks , a note to check the logs as that is changed to log 429 if it is received. |
Signed-off-by: Harkishen-Singh <harkishensingh@hotmail.com>
Signed-off-by: Harkishen-Singh <harkishensingh@hotmail.com>
0e474bb to
fbc0e24
Compare
csmarchbanks
left a comment
There was a problem hiding this comment.
Thanks! 👍 from me. @roidelapluie would you take another look to make sure your comments have been addressed?
|
LGTM, we could add later a flag to disable this retries if we see issues in the wild. |
|
This is indeed a breaking change. From what I can we see, we never drop data here? This means Prometheus will just keep falling behind if it is being constantly rate-limited? This assumes that Prometheus doesn't send data to a remote system that basically limits it constantly. This is not true, we limit people in GrafanaCloud pretty regularly and they're completely fine with it. Further, in Cortex, we return 429 not just on a samples per second limit and do it for other cases too. For example, we have a limit for active series and if a user sends more than their active series limit, we return a 429. In this case, Prometheus would never be able to proceed because it would constantly try to send the sample that creates a new series and it would fail forever. Now I am not sure if Cortex has the right behavior with 429 for other limits, it seemed pretty right when we built it ;) I still think it has the right behavior, but given Cortex has a major chunk of remote-write use-cases, we shouldn't roll out this change in Prometheus until we fix it in Cortex. WDYT @roidelapluie @csmarchbanks? |
|
My choice would be not to break users and threat 429 and non-recoverable by default, with the option to opt-in for this retries. That default could be changed in Prometheus 3.x, I do not think we should depend on only changes in remote writes to make that switch. |
@gouthamve I am sorry but isn't this very specific to cortex? That looks like a custom usage of 429 (I can be wrong). But, unless the |
@Harkishen-Singh I don't think so. Before this PR, a 429 was considered a non recoverable error. After this PR, a 429 is always retried until succeed, either it contains the |
This is not exact: before, we would not retry at all. Now, with 429 and no retry-after, we would retry with the defaut backoff. |
|
Note: #8474 is open. |
|
Sorry, I think this change is only safe if you don't retry without the header, so it would default to the same behavior as previously for 429, and it should be behind a flag probably. Otherwise this will retry even without the header from the remote storage backend, which could be very problematic when 429's are returned for rate limiting or even reasons that are not ever going to succeed on retry. |
…rding (prometheus#8237)" This reverts commit cd41247. Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
…rometheus#8237) This reverts commit 49a8ce5. This commit is necessary since we only wanted to not have the functionality in 2.25. It will be improved soon on the main branch.
…rometheus#8237) This reverts commit 49a8ce5. This commit is necessary since we only wanted to not have the functionality in 2.25. It will be improved soon on the main branch. Co-authored-by: Harkishen-Singh <harkishensingh@hotmail.com> Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
Signed-off-by: Harkishen-Singh harkishensingh@hotmail.com
Fixes: #8418
We consider 400 status code same as 200 status code for the purpose of allowing resharding. However, if the code is already rate-limiting (i.e.,429), this will make the matter worse. Hence, take this as recoverable errors and fall in the backoff loop. Also, 100 milliseconds is short for a 429, so maybe 1 second be a good balance.
cc @csmarchbanks @cstyan