Skip to content

Consider status code 429 as recoverable errors to avoid resharding#8237

Merged
csmarchbanks merged 2 commits intoprometheus:masterfrom
harkishen:reshard-backoff-on-429
Feb 10, 2021
Merged

Consider status code 429 as recoverable errors to avoid resharding#8237
csmarchbanks merged 2 commits intoprometheus:masterfrom
harkishen:reshard-backoff-on-429

Conversation

@harkishen
Copy link
Contributor

@harkishen harkishen commented Nov 28, 2020

Signed-off-by: Harkishen-Singh harkishensingh@hotmail.com

Fixes: #8418

We consider 400 status code same as 200 status code for the purpose of allowing resharding. However, if the code is already rate-limiting (i.e.,429), this will make the matter worse. Hence, take this as recoverable errors and fall in the backoff loop. Also, 100 milliseconds is short for a 429, so maybe 1 second be a good balance.

cc @csmarchbanks @cstyan

@roidelapluie
Copy link
Member

roidelapluie commented Nov 29, 2020

This is a breaking change.

I think this is a good idea, because it is very annoying in practice, but should be configurable.

@harkishen
Copy link
Contributor Author

@roidelapluie with configurable, do you mean to give users the ability to enable to disable backoff for rate-limiting?

@roidelapluie
Copy link
Member

retryability, yes. (but I am not a remote write maintainer)

@csmarchbanks
Copy link
Member

First, thanks for working on this, I think something like this behavior is certainly useful as dropping 429s indiscriminately is not ideal.

That said, I am not sure we ever decided what the best behavior on a 429 is. If you encounter a 429 in regular operation, chances are your remote write will just continue to fall behind until you force a restart dropping a significant chunk of data. Right now the 429 behavior can be useful because dropping some samples basically means less resolution in your data which might be better for many people not having any recent data at all.

Configuration of what to do on a 429 would be ok, but are there other ideas for handling rate limits more gracefully? Sending a new request that will probably fail 30ms later doesn't seem great. Limited number of retries + respect for the Retry-After header could be one option? This might end up relating to #7912.

@harkishen
Copy link
Contributor Author

harkishen commented Nov 30, 2020

I didn't get the retry after thing. Is it something that remote storage will respond to the remote write component (if yes, then it would be a response, how header?), a time after which only, it should retry?

@csmarchbanks
Copy link
Member

The Retry-After response header may be returned by remote storage to tell the client (remote write) how long to wait before trying another send. It is called out in https://tools.ietf.org/html/rfc6585#section-4.

To be clear, I am not saying implement the Retry-After logic. I am just not sure the 429 behavior should be the exact same as 5XX retry behavior, and would like a discussion. I am curious what @cstyan thinks too.

@harkishen
Copy link
Contributor Author

harkishen commented Nov 30, 2020

Ah, I feel like rate limiting is a thing from the remote storage. That means, the remote storage should get more control as to how it wants a particular request to behave. This gives itself the ability to handle the situation and come out of it. So, the response header should be a good way out here. One header can be, Retry-After, and other (as TTL), Max-Retries?

@cstyan
Copy link
Member

cstyan commented Dec 4, 2020

If there's a standard for how to respond to 429s via the Retry-After header I think we should support it. I'm hesitant to include an option for Max-Retries given the work we've put into remote write to attempt to never lose data, but not strongly opposed.

@harkishen
Copy link
Contributor Author

Ah, any suggestions on how we should move ahead here?

@cstyan
Copy link
Member

cstyan commented Dec 8, 2020

Supporting Retry-After when we see a 429 seems like the way to go to me, @csmarchbanks ?

@harkishen harkishen force-pushed the reshard-backoff-on-429 branch from 3b61ad8 to 33922ac Compare January 4, 2021 04:38
@harkishen
Copy link
Contributor Author

@csmarchbanks @cstyan I have added the support for Retry-After and Max-Retries (since the discussion seems to be favourable towards some max-retries limit) as input from the response headers from remote storage systems. Can you please give towards the implementation?

Copy link
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favor of supporting the Retry-After header, but I don't think we should add Max-Retries as a header response or as part of this PR. Looking at #8237 (comment), Callum is also hesitant, so let's focus on just 429 behavior and Retry-After in this PR.

// If we make it this far, we've encountered a recoverable error and will retry.
onRetry()
level.Debug(l).Log("msg", "failed to send batch, retrying", "err", err)
level.Debug(l).Log("msg", "failed to send batch, retrying", "err", backoffErr.Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat a reminder for me in case a rebase doesn't go well, this should be Warn level, and failed should be Failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented this in the recent push.

@harkishen harkishen force-pushed the reshard-backoff-on-429 branch 2 times, most recently from 563bc8d to 1778657 Compare January 29, 2021 09:57
Copy link
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some good progress, a couple ideas mostly to simplify the code!

}
onRetry()
} else {
level.Info(l).Log("msg", "retry-after cannot be in past, retrying the default backoff way")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be a debug log. I would expect it to happen sometimes and it is a reasonable behavior. You could also get rid of it by always sleeping the larger amount of backoff or retryAfter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sleeping the larger amount of backoff or retryAfter

Sorry, but how can I get the larger of the two when retryAfter is itself in past? Am I missing something?

@harkishen harkishen force-pushed the reshard-backoff-on-429 branch from 1778657 to 1ad4361 Compare January 29, 2021 19:56
@harkishen harkishen force-pushed the reshard-backoff-on-429 branch from 1ad4361 to 8bf7442 Compare January 29, 2021 20:49
config/config.go Outdated
// Backoff times for retrying a batch of samples on recoverable errors.
MinBackoff: model.Duration(30 * time.Millisecond),
MaxBackoff: model.Duration(100 * time.Millisecond),
MaxBackoff: model.Duration(1000 * time.Millisecond),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the default max backoff change is not really related to handling 429s better. Could that be part of a different PR if we want to do it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I have reverted this change.

parsedDuration, err := time.Parse(time.RFC1123, t)
if err == nil {
if parsedDuration.Before(time.Now()) {
return invalidRetryAfter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we check for < 0 later on we do not need this.

Copy link
Contributor Author

@harkishen harkishen Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do this way as well or have a unit test for this function (which seems more reliable). For now, I am going with the later. I am happy to revert if we require.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the more I think about it the more I would prefer the check for < 0 rather than depending on -1. A whole lot more invalid cases are covered with that check in the case of future changes.

@harkishen harkishen force-pushed the reshard-backoff-on-429 branch from 8bf7442 to 42b07b4 Compare February 1, 2021 10:07
@harkishen
Copy link
Contributor Author

@roidelapluie got it!

parsedDuration, err := time.Parse(http.TimeFormat, t)
if err == nil {
s := time.Until(parsedDuration).Seconds()
return model.Duration(s) * model.Duration(time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return model.Duration(s) * model.Duration(time.Second)
return model.Duration(s * time.Second)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this is a compilation error. We cannot multiply s (float64) and time.Second (time.Duration) and return to a model.Duration

@roidelapluie
Copy link
Member

LGTM after the last nits. I think that the logs would log that there is a 429, so that should be good for users to understand what is going on.

@harkishen harkishen force-pushed the reshard-backoff-on-429 branch from 6d12009 to 0e474bb Compare February 9, 2021 16:30
@harkishen
Copy link
Contributor Author

@csmarchbanks , a note to check the logs as that is changed to log 429 if it is received.

Signed-off-by: Harkishen-Singh <harkishensingh@hotmail.com>
Signed-off-by: Harkishen-Singh <harkishensingh@hotmail.com>
@harkishen harkishen force-pushed the reshard-backoff-on-429 branch from 0e474bb to fbc0e24 Compare February 10, 2021 17:31
Copy link
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 👍 from me. @roidelapluie would you take another look to make sure your comments have been addressed?

@roidelapluie
Copy link
Member

LGTM, we could add later a flag to disable this retries if we see issues in the wild.

@csmarchbanks csmarchbanks merged commit cd41247 into prometheus:master Feb 10, 2021
@gouthamve
Copy link
Member

gouthamve commented Feb 11, 2021

This is indeed a breaking change. From what I can we see, we never drop data here? This means Prometheus will just keep falling behind if it is being constantly rate-limited? This assumes that Prometheus doesn't send data to a remote system that basically limits it constantly.

This is not true, we limit people in GrafanaCloud pretty regularly and they're completely fine with it. Further, in Cortex, we return 429 not just on a samples per second limit and do it for other cases too. For example, we have a limit for active series and if a user sends more than their active series limit, we return a 429. In this case, Prometheus would never be able to proceed because it would constantly try to send the sample that creates a new series and it would fail forever.

Now I am not sure if Cortex has the right behavior with 429 for other limits, it seemed pretty right when we built it ;) I still think it has the right behavior, but given Cortex has a major chunk of remote-write use-cases, we shouldn't roll out this change in Prometheus until we fix it in Cortex. WDYT @roidelapluie @csmarchbanks?

@roidelapluie
Copy link
Member

My choice would be not to break users and threat 429 and non-recoverable by default, with the option to opt-in for this retries. That default could be changed in Prometheus 3.x, I do not think we should depend on only changes in remote writes to make that switch.

@harkishen
Copy link
Contributor Author

harkishen commented Feb 11, 2021

429 not just on a samples per second limit and do it for other cases too

@gouthamve I am sorry but isn't this very specific to cortex?

That looks like a custom usage of 429 (I can be wrong). But, unless the Retry-After is specified, it would continue as earlier.

@pracucci
Copy link
Contributor

But, unless the Retry-After is specified, it would continue as earlier.

@Harkishen-Singh I don't think so. Before this PR, a 429 was considered a non recoverable error. After this PR, a 429 is always retried until succeed, either it contains the Retry-After or not.

@roidelapluie
Copy link
Member

roidelapluie commented Feb 11, 2021

429 not just on a samples per second limit and do it for other cases too

@gouthamve I am sorry but isn't this very specific to cortex?

That looks like a custom usage of 429 (I can be wrong). But, unless the Retry-After is specified, it would continue as earlier.

This is not exact: before, we would not retry at all. Now, with 429 and no retry-after, we would retry with the defaut backoff.

@roidelapluie
Copy link
Member

Note: #8474 is open.

@ranton256
Copy link

Sorry, I think this change is only safe if you don't retry without the header, so it would default to the same behavior as previously for 429, and it should be behind a flag probably. Otherwise this will retry even without the header from the remote storage backend, which could be very problematic when 429's are returned for rate limiting or even reasons that are not ever going to succeed on retry.

@csmarchbanks
Copy link
Member

csmarchbanks commented Feb 12, 2021

We certainly understand the concern, and should have put in a flag as part of this PR, sorry about that. We are quickly adding a flag for this release, and will likely put rules such as what you suggest or limited number of retries in future releases. See #8474 and #8477.

roidelapluie pushed a commit to roidelapluie/prometheus that referenced this pull request Feb 12, 2021
…rding (prometheus#8237)"

This reverts commit cd41247.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
roidelapluie pushed a commit to roidelapluie/prometheus that referenced this pull request Feb 12, 2021
…rometheus#8237)

This reverts commit 49a8ce5.

This commit is necessary since we only wanted to not have the
functionality in 2.25. It will be improved soon on the main branch.
roidelapluie pushed a commit to roidelapluie/prometheus that referenced this pull request Feb 12, 2021
…rometheus#8237)

This reverts commit 49a8ce5.

This commit is necessary since we only wanted to not have the
functionality in 2.25. It will be improved soon on the main branch.

Co-authored-by: Harkishen-Singh <harkishensingh@hotmail.com>

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remote write should retry on 429

7 participants