Consider status code 429 as recoverable errors to avoid resharding by harkishen · Pull Request #8237 · prometheus/prometheus

harkishen · 2020-11-28T19:40:17Z

Signed-off-by: Harkishen-Singh harkishensingh@hotmail.com

We consider 400 status code same as 200 status code for the purpose of allowing resharding. However, if the code is already rate-limiting (i.e.,429), this will make the matter worse. Hence, take this as recoverable errors and fall in the backoff loop. Also, 100 milliseconds is short for a 429, so maybe 1 second be a good balance.

cc @csmarchbanks @cstyan

roidelapluie · 2020-11-29T18:57:21Z

This is a breaking change.

I think this is a good idea, because it is very annoying in practice, but should be configurable.

storage/remote/client.go

harkishen · 2020-11-30T07:05:41Z

@roidelapluie with configurable, do you mean to give users the ability to enable to disable backoff for rate-limiting?

roidelapluie · 2020-11-30T07:10:44Z

retryability, yes. (but I am not a remote write maintainer)

csmarchbanks · 2020-11-30T15:34:23Z

First, thanks for working on this, I think something like this behavior is certainly useful as dropping 429s indiscriminately is not ideal.

That said, I am not sure we ever decided what the best behavior on a 429 is. If you encounter a 429 in regular operation, chances are your remote write will just continue to fall behind until you force a restart dropping a significant chunk of data. Right now the 429 behavior can be useful because dropping some samples basically means less resolution in your data which might be better for many people not having any recent data at all.

Configuration of what to do on a 429 would be ok, but are there other ideas for handling rate limits more gracefully? Sending a new request that will probably fail 30ms later doesn't seem great. Limited number of retries + respect for the Retry-After header could be one option? This might end up relating to #7912.

harkishen · 2020-11-30T16:45:26Z

I didn't get the retry after thing. Is it something that remote storage will respond to the remote write component (if yes, then it would be a response, how header?), a time after which only, it should retry?

csmarchbanks · 2020-11-30T18:45:52Z

The Retry-After response header may be returned by remote storage to tell the client (remote write) how long to wait before trying another send. It is called out in https://tools.ietf.org/html/rfc6585#section-4.

To be clear, I am not saying implement the Retry-After logic. I am just not sure the 429 behavior should be the exact same as 5XX retry behavior, and would like a discussion. I am curious what @cstyan thinks too.

harkishen · 2020-11-30T18:50:07Z

Ah, I feel like rate limiting is a thing from the remote storage. That means, the remote storage should get more control as to how it wants a particular request to behave. This gives itself the ability to handle the situation and come out of it. So, the response header should be a good way out here. One header can be, Retry-After, and other (as TTL), Max-Retries?

cstyan · 2020-12-04T22:46:31Z

If there's a standard for how to respond to 429s via the Retry-After header I think we should support it. I'm hesitant to include an option for Max-Retries given the work we've put into remote write to attempt to never lose data, but not strongly opposed.

harkishen · 2020-12-05T06:48:33Z

Ah, any suggestions on how we should move ahead here?

cstyan · 2020-12-08T19:11:33Z

Supporting Retry-After when we see a 429 seems like the way to go to me, @csmarchbanks ?

harkishen · 2021-01-04T06:18:35Z

@csmarchbanks @cstyan I have added the support for Retry-After and Max-Retries (since the discussion seems to be favourable towards some max-retries limit) as input from the response headers from remote storage systems. Can you please give towards the implementation?

csmarchbanks

I am in favor of supporting the Retry-After header, but I don't think we should add Max-Retries as a header response or as part of this PR. Looking at #8237 (comment), Callum is also hesitant, so let's focus on just 429 behavior and Retry-After in this PR.

storage/remote/client.go

csmarchbanks · 2021-01-28T16:23:25Z

storage/remote/queue_manager.go

 		// If we make it this far, we've encountered a recoverable error and will retry.
 		onRetry()
-		level.Debug(l).Log("msg", "failed to send batch, retrying", "err", err)
+		level.Debug(l).Log("msg", "failed to send batch, retrying", "err", backoffErr.Error())


Somewhat a reminder for me in case a rebase doesn't go well, this should be Warn level, and failed should be Failed.

I have implemented this in the recent push.

csmarchbanks

Some good progress, a couple ideas mostly to simplify the code!

storage/remote/client.go

csmarchbanks · 2021-01-29T16:20:23Z

storage/remote/queue_manager.go

+				}
+				onRetry()
+			} else {
+				level.Info(l).Log("msg", "retry-after cannot be in past, retrying the default backoff way")


I think this could be a debug log. I would expect it to happen sometimes and it is a reasonable behavior. You could also get rid of it by always sleeping the larger amount of backoff or retryAfter.

sleeping the larger amount of backoff or retryAfter

Sorry, but how can I get the larger of the two when retryAfter is itself in past? Am I missing something?

storage/remote/client.go

storage/remote/queue_manager.go

csmarchbanks · 2021-01-29T21:01:20Z

config/config.go

 		// Backoff times for retrying a batch of samples on recoverable errors.
 		MinBackoff: model.Duration(30 * time.Millisecond),
-		MaxBackoff: model.Duration(100 * time.Millisecond),
+		MaxBackoff: model.Duration(1000 * time.Millisecond),


Hmm, the default max backoff change is not really related to handling 429s better. Could that be part of a different PR if we want to do it?

Sure, I have reverted this change.

storage/remote/queue_manager.go

storage/remote/client.go

roidelapluie · 2021-01-29T22:17:00Z

storage/remote/client.go

+	parsedDuration, err := time.Parse(time.RFC1123, t)
+	if err == nil {
+		if parsedDuration.Before(time.Now()) {
+			return invalidRetryAfter


If we check for < 0 later on we do not need this.

We can do this way as well or have a unit test for this function (which seems more reliable). For now, I am going with the later. I am happy to revert if we require.

Yeah, the more I think about it the more I would prefer the check for < 0 rather than depending on -1. A whole lot more invalid cases are covered with that check in the case of future changes.

harkishen · 2021-02-08T17:10:57Z

@roidelapluie got it!

storage/remote/queue_manager.go

roidelapluie · 2021-02-08T17:12:07Z

storage/remote/client.go

+	parsedDuration, err := time.Parse(http.TimeFormat, t)
+	if err == nil {
+		s := time.Until(parsedDuration).Seconds()
+		return model.Duration(s) * model.Duration(time.Second)


Suggested change

return model.Duration(s) * model.Duration(time.Second)

return model.Duration(s * time.Second)

Sorry, this is a compilation error. We cannot multiply s (float64) and time.Second (time.Duration) and return to a model.Duration

storage/remote/client.go

roidelapluie · 2021-02-08T17:19:34Z

LGTM after the last nits. I think that the logs would log that there is a 429, so that should be good for users to understand what is going on.

harkishen · 2021-02-09T16:31:35Z

@csmarchbanks , a note to check the logs as that is changed to log 429 if it is received.

storage/remote/queue_manager.go

Signed-off-by: Harkishen-Singh <harkishensingh@hotmail.com>

csmarchbanks

Thanks! 👍 from me. @roidelapluie would you take another look to make sure your comments have been addressed?

roidelapluie · 2021-02-10T21:38:04Z

LGTM, we could add later a flag to disable this retries if we see issues in the wild.

gouthamve · 2021-02-11T11:07:11Z

This is indeed a breaking change. From what I can we see, we never drop data here? This means Prometheus will just keep falling behind if it is being constantly rate-limited? This assumes that Prometheus doesn't send data to a remote system that basically limits it constantly.

This is not true, we limit people in GrafanaCloud pretty regularly and they're completely fine with it. Further, in Cortex, we return 429 not just on a samples per second limit and do it for other cases too. For example, we have a limit for active series and if a user sends more than their active series limit, we return a 429. In this case, Prometheus would never be able to proceed because it would constantly try to send the sample that creates a new series and it would fail forever.

Now I am not sure if Cortex has the right behavior with 429 for other limits, it seemed pretty right when we built it ;) I still think it has the right behavior, but given Cortex has a major chunk of remote-write use-cases, we shouldn't roll out this change in Prometheus until we fix it in Cortex. WDYT @roidelapluie @csmarchbanks?

roidelapluie · 2021-02-11T11:33:49Z

My choice would be not to break users and threat 429 and non-recoverable by default, with the option to opt-in for this retries. That default could be changed in Prometheus 3.x, I do not think we should depend on only changes in remote writes to make that switch.

harkishen · 2021-02-11T13:07:48Z

429 not just on a samples per second limit and do it for other cases too

@gouthamve I am sorry but isn't this very specific to cortex?

That looks like a custom usage of 429 (I can be wrong). But, unless the Retry-After is specified, it would continue as earlier.

pracucci · 2021-02-11T13:26:13Z

But, unless the Retry-After is specified, it would continue as earlier.

@Harkishen-Singh I don't think so. Before this PR, a 429 was considered a non recoverable error. After this PR, a 429 is always retried until succeed, either it contains the Retry-After or not.

roidelapluie · 2021-02-11T13:26:21Z

429 not just on a samples per second limit and do it for other cases too

@gouthamve I am sorry but isn't this very specific to cortex?

That looks like a custom usage of 429 (I can be wrong). But, unless the Retry-After is specified, it would continue as earlier.

This is not exact: before, we would not retry at all. Now, with 429 and no retry-after, we would retry with the defaut backoff.

roidelapluie · 2021-02-11T13:27:01Z

Note: #8474 is open.

ranton256 · 2021-02-11T23:54:18Z

Sorry, I think this change is only safe if you don't retry without the header, so it would default to the same behavior as previously for 429, and it should be behind a flag probably. Otherwise this will retry even without the header from the remote storage backend, which could be very problematic when 429's are returned for rate limiting or even reasons that are not ever going to succeed on retry.

csmarchbanks · 2021-02-12T00:00:22Z

We certainly understand the concern, and should have put in a flag as part of this PR, sorry about that. We are quickly adding a flag for this release, and will likely put rules such as what you suggest or limited number of retries in future releases. See #8474 and #8477.

…rding (prometheus#8237)" This reverts commit cd41247. Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>

…rometheus#8237) This reverts commit 49a8ce5. This commit is necessary since we only wanted to not have the functionality in 2.25. It will be improved soon on the main branch.

…rometheus#8237) This reverts commit 49a8ce5. This commit is necessary since we only wanted to not have the functionality in 2.25. It will be improved soon on the main branch. Co-authored-by: Harkishen-Singh <harkishensingh@hotmail.com> Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>

roidelapluie reviewed Nov 29, 2020

View reviewed changes

storage/remote/client.go Outdated Show resolved Hide resolved

harkishen force-pushed the reshard-backoff-on-429 branch from 3b61ad8 to 33922ac Compare January 4, 2021 04:38

csmarchbanks reviewed Jan 17, 2021

View reviewed changes

storage/remote/client.go Outdated Show resolved Hide resolved

storage/remote/client.go Outdated Show resolved Hide resolved

This was referenced Jan 28, 2021

Remote write should retry on 429 #8418

Closed

Cortex should tell remote write clients to slow down on rate-limiting cortexproject/cortex#2037

Open

csmarchbanks reviewed Jan 28, 2021

View reviewed changes

harkishen force-pushed the reshard-backoff-on-429 branch 2 times, most recently from 563bc8d to 1778657 Compare January 29, 2021 09:57

csmarchbanks reviewed Jan 29, 2021

View reviewed changes

harkishen force-pushed the reshard-backoff-on-429 branch from 1778657 to 1ad4361 Compare January 29, 2021 19:56

csmarchbanks reviewed Jan 29, 2021

View reviewed changes

storage/remote/client.go Outdated Show resolved Hide resolved

storage/remote/client.go Outdated Show resolved Hide resolved

harkishen force-pushed the reshard-backoff-on-429 branch from 1ad4361 to 8bf7442 Compare January 29, 2021 20:49

harkishen requested a review from csmarchbanks January 29, 2021 20:49

csmarchbanks reviewed Jan 29, 2021

View reviewed changes

roidelapluie reviewed Jan 29, 2021

View reviewed changes

storage/remote/queue_manager.go Outdated Show resolved Hide resolved

roidelapluie reviewed Jan 29, 2021

View reviewed changes

storage/remote/client.go Outdated Show resolved Hide resolved

roidelapluie reviewed Jan 29, 2021

View reviewed changes

harkishen force-pushed the reshard-backoff-on-429 branch from 8bf7442 to 42b07b4 Compare February 1, 2021 10:07

harkishen requested a review from csmarchbanks February 1, 2021 10:08

roidelapluie reviewed Feb 8, 2021

View reviewed changes

storage/remote/queue_manager.go Outdated Show resolved Hide resolved

roidelapluie reviewed Feb 8, 2021

View reviewed changes

storage/remote/client.go Show resolved Hide resolved

harkishen force-pushed the reshard-backoff-on-429 branch from 6d12009 to 0e474bb Compare February 9, 2021 16:30

csmarchbanks reviewed Feb 9, 2021

View reviewed changes

storage/remote/queue_manager.go Outdated Show resolved Hide resolved

harkishen added 2 commits February 10, 2021 23:00

Consider status code 429 as recoverable errors to avoid resharding.

a0405bf

Signed-off-by: Harkishen-Singh <harkishensingh@hotmail.com>

Adds support for Retry-After in backoff logic in remote storage.

fbc0e24

Signed-off-by: Harkishen-Singh <harkishensingh@hotmail.com>

harkishen force-pushed the reshard-backoff-on-429 branch from 0e474bb to fbc0e24 Compare February 10, 2021 17:31

csmarchbanks approved these changes Feb 10, 2021

View reviewed changes

csmarchbanks merged commit cd41247 into prometheus:master Feb 10, 2021

roidelapluie mentioned this pull request Feb 11, 2021

Remote write: Make HTTP 429 errors recoverable only when the user opts-in #8474

Closed

csmarchbanks mentioned this pull request Feb 12, 2021

Configure retry on Rate-Limiting from remote-write config #8477

Merged

stevesg mentioned this pull request Feb 15, 2021

React to series/metadata limits with 400 instead of 429 in the ingester. cortexproject/cortex#3827

Closed

roidelapluie mentioned this pull request Mar 24, 2021

Cut 2.26.0-rc.0 release. #8640

Merged

clubanderson mentioned this pull request Mar 11, 2026

🌱 prometheus: Proposal: Automatically Drop Old Remote Write Samples Before Sending kubestellar/console-kb#519

Closed

	return model.Duration(s) * model.Duration(time.Second)
	return model.Duration(s * time.Second)

Conversation

harkishen commented Nov 28, 2020 • edited by csmarchbanks Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roidelapluie commented Nov 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

harkishen commented Nov 30, 2020

Uh oh!

roidelapluie commented Nov 30, 2020

Uh oh!

csmarchbanks commented Nov 30, 2020

Uh oh!

harkishen commented Nov 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

csmarchbanks commented Nov 30, 2020

Uh oh!

harkishen commented Nov 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cstyan commented Dec 4, 2020

Uh oh!

harkishen commented Dec 5, 2020

Uh oh!

cstyan commented Dec 8, 2020

Uh oh!

harkishen commented Jan 4, 2021

Uh oh!

csmarchbanks left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csmarchbanks left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harkishen Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harkishen commented Feb 8, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

roidelapluie commented Feb 8, 2021

Uh oh!

harkishen commented Feb 9, 2021

Uh oh!

Uh oh!

csmarchbanks left a comment

harkishen commented Nov 28, 2020 •

edited by csmarchbanks

Loading

roidelapluie commented Nov 29, 2020 •

edited

Loading

harkishen commented Nov 30, 2020 •

edited

Loading

harkishen commented Nov 30, 2020 •

edited

Loading

harkishen Feb 1, 2021 •

edited

Loading

gouthamve commented Feb 11, 2021 •

edited

Loading

harkishen commented Feb 11, 2021 •

edited

Loading

roidelapluie commented Feb 11, 2021 •

edited

Loading

csmarchbanks commented Feb 12, 2021 •

edited

Loading