feat(cody): add circuit breaker to handle timed-out requests and rate limit hits by taras-yemets · Pull Request #64133 · sourcegraph/sourcegraph-public-snapshot

taras-yemets · 2024-07-29T15:12:36Z

Closes https://linear.app/sourcegraph/issue/CODY-2758/[autocomplete-latency]-add-circuit-breaker-in-cody-gateway-to-handle

This PR introduces an in-memory model availability tracker to ensure we do not send consequent requests to the currently unavailable LLM providers.

The tracker maintains a history of error records for each model. It evaluates these records to determine whether the model is available for new requests. The evaluation follows these steps:

For every request to an upstream provider, the tracker records any errors that occur. Specifically, it logs timeout errors (when a request exceeds its deadline) and responses with a 429 status code (Too Many Requests).
These error records are stored in a circular buffer for each model. This buffer holds a fixed number of records, ensuring efficient memory usage.
The tracker calculates the failure ratio by analyzing the stored records. It checks the percentage of errors within a specified evaluation window and compares this against the total number of recent requests.
Based on the calculated failure ratio, the tracker decides whether the model is available:
- Model Unavailable: If the ratio of failures (timeouts or 429 status codes) exceeds a predefined threshold (X%), the model is marked as unavailable. In this state, the system does not send new requests to the upstream provider.
When a model is unavailable, the system immediately returns an error status code, typically a 503 Service Unavailable, to the client. This informs the client that the service is temporarily unavailable due to upstream issues.
- Model Available: If the failure ratio is within acceptable limits, the system proceeds with sending the request to the upstream provider.

This PR suggests considering a model unavailable if 95% of the last 100 requests within the past minute either time out or return a 429 status code. I am not sure about these exact values and suggest them as a starting point for discussion

Test plan

Added unit tests
CI

Changelog

chrsmith

This definitely seems like it's a good approach, and on the right track.

My only substantive feedback is to consider revisiting the data structure you are using for keeping track of the errors so that it can take up less space and/or be more efficient. (Since continuously copying/resizing slices isn't great.)

e.g. consider if we instead just made these all fixed-sizes using a Golang fixed-size array instead of a slice and/or using a proper circular buffer.

One thing that does feel a little off though, is that this is inherently going to run into data races. As multiple concurrent requests go to report their results. We need to use a mutex or channel to ensure that we have one writer at a time. (Which could lead to increasing latency due to contention.) Or use some sort of lock-free data structure, and/or more fancy approach.

But we probably should just use a mutex since we only would write the record in the event of the upstream request failing, right? So for the common-case, when we get a 200, there won't be any overhead at all. Right?

chrsmith · 2024-07-29T16:13:18Z

+// errorThresholdForUnavailability is the maximum number of error records
+// allowed within the modelAvailabilityCheckWindow before a model is
+// considered unavailable.
+const errorThresholdForUnavailability = 10


This is an OK starting point. But this is one of those things that doesn't scale very well.

e.g. imagine we get 100x the number of requests, because of generic growth. Then, even if the transient error rate is 0.0001%, eventually you'd hit this static threshold.

So again, this is fine for now. But just keep in the back of your mind that we'll want to make this a bit more robust in the future. Perhaps by checking a "percentage of all requests within that time window", etc.

chrsmith · 2024-07-29T16:17:07Z

+
+func newModelsLoadTracker() *modelsLoadTracker {
+	return &modelsLoadTracker{
+		records: map[string][]*record{},


Minor-nit: This is totally OK. But I'm guessing that we have some idea of how many records we want to store. So it would be a tiny bit more efficient to pre-allocate the space. e.g. make(map[string][]*record{}, 1024).

Note that you only really benefit from that if the map will grow considerably (e.g. contain thousands of elements. If we only expect ~dozens or ~hundreds of elements, then this would fall into the 🤷 not worth it camp.)

chrsmith · 2024-07-29T16:21:31Z

+	}
+
+	records := mlt.records[gatewayModel]
+	records = append([]*record{r}, records...)


⚠️ this is really bad from an efficiency perspective.

You are creating a new slice with a single element. ([]*record{r})

You are then appending records to the end of it. (Which will need to resize the newly created slice, and copy over all the elements.)

Instead, you should just write this as:

records = append(records, r)

This will keep the data in the records slice as-is, and only copy the new element over. If the underlying slice didn't have enough space, only then would it "copy all the elements into a new slice".

Does that all make sense?

Addressed when implementing suggestion from https://github.com/sourcegraph/sourcegraph/pull/64133#discussion_r1695520961 by creating a slice with a predefined length and assigning new values by index instead of appending. Is this approach sustainable?

https://github.com/sourcegraph/sourcegraph/blob/6080a047e634cbf76d755bfd44974795423105a1/cmd/cody-gateway/internal/httpapi/completions/modelsloadtracker.go#L82-L96

by creating a slice with a predefined length and assigning new values by index instead of appending. Is this approach sustainable?

Yup, SGTM. I think it would be arguably, slightly cleaner to use a fixed-width array. (e.g. the type of records being [20]record.) But that's more of a matter of style.

The key thing is that once we introduce the modelCircuitBreaker, that we don't do a bunch of unnecessary copying/resizing of the slice. (Which we avoid by using that headIndex and just overwriting previous failures.)

chrsmith

Lots of nitpicks, but this code generally looks good to me. However, I left a couple of important comments with a ⚠️ since I'm not entirely sure we want to land the logic that you have here. (But if I am mistaken, or this is intentional, go for it.)

Also, this seems like a situation where some unit tests would be appreciated. (Hitting the sweet spot where the code is sophisticated enough that they would be helpful to confirm things work; but the code is also simple enough that writing the tests wouldn't be a major PITA 😅 )

chrsmith · 2024-07-30T17:14:55Z

+// pattern to temporarily mark models as unavailable if they exceed the allowed number
+// of errors within the defined time window.
+type modelsLoadTracker struct {
+	mu sync.RWMutex


Consider moving the mutex down into the modelCircuitBreaker. That way we can record failures for two different models simultaneously. It's just a way of further reducing the possibility of contention on the lock.

With the updated updated implementation (https://github.com/sourcegraph/sourcegraph/pull/64133#issuecomment-2263187863) we have two mutexes: one for modelsLoadTracker (to lock on modelCircuitBreaker creation) and one on modelCircuitBreaker to lock on adding record and calculating the failure ratio. Does this approach sound reasonable to you?

chrsmith · 2024-07-30T17:15:47Z

+
+	// failureThreshold represents the maximum number of error records
+	// allowed within the timeout before a model is considered unavailable.
+	failureThreshold int


Nit: What do you think about maxFailures or even maxAllowedFailures as a name? failureThreshold is definitely OK. But max-* might seem like a better fit? WDYT?

With the updated implementation (https://github.com/sourcegraph/sourcegraph/pull/64133#issuecomment-2263187863) I renamed this to failureRatio, WDYT?

chrsmith · 2024-07-30T17:21:32Z

+// record represents an individual error occurrence with details about the reason for the error
+// and the timestamp when it happened. This information is used to assess the model's availability.
+type record struct {
+	reason    string


Nit: having reason as a string is helpful for the developer, but probably is just more overhead than is useful. Not that we need to worry about being ultra-efficient here, but if we just had this as statusCode int and just kept track of whatever the "non-2xx" HTTP response status code was, that'd be sufficient IMHO. But this certainly will make it easier to diagnose.

Changed to use HTTP status codes as reasons in https://github.com/sourcegraph/sourcegraph/pull/64133/commits/c1c1db4cd9865c06998fca286634696a423723e7.

just kept track of whatever the "non-2xx" HTTP response status code was

We consider the a model unavailable if requests to it fail because of a context.DeadlineExceeded error or result in responses with 429 status codes. We ignore other non-2xx status codes. Do you think we should track all non-2xx responses?

WDYT, @chrsmith?

I've left a comment elsewhere on the PR. I could be convinced either way how to approach this.

chrsmith · 2024-07-30T17:31:37Z

+}
+
+// isModelAvailable checks if a model is available based on the number of failures within the specified timeout period.
+// Returns false if there is at least one failure within the timeout period. Otherwise, returns true.


⚠️ Is this the behavior that we want? I guess I misunderstood the exact way you were thinking about enforcing the circuit breaker. But I think we should expect some non-zero amount of "random, transient bullshit" from the upstream LLM provider. (e.g. a network cable gets cut, or some random hardware outage on their side.)

So having Cody Gateway consider a model unavailable with as few as a single error seems too sensitive. I thought we would instead require failureThreshold errors within timeout seconds. Does that seem like a better approach, or am I missing something?

Is this the behavior that we want?

Same as https://github.com/sourcegraph/sourcegraph/pull/64133#discussion_r1698293305.
Thanks for taking another look at the PR, I'm happy that we have this discussion on the PR, but not as a post-mortem 😅

we would instead require failureThreshold errors within timeout seconds

I don't think it'll work. Let's assume the failureThreshold is 10.
If within the timeout period we get 10,000 requests and only record failureThreshold number of errors, the model should still be considered available (as only 0.001% of requests indicate an issue). However, if within the timeout period we get 12 requests and 10 of them result in a 429 error, the model should be considered temporarily unavailable.

So, maybe we should implement the "percentage of all requests within the time window" approach you suggested in https://github.com/sourcegraph/sourcegraph/pull/64133#discussion_r1695504899?

implement the "percentage of all requests within the time window" approach

I’ve started exploring the implementation of X percent of failing requests out of the total requests within a Y-minute timeframe. I checked the Cody Gateway logs and noticed that it receives over 3k requests per minute to /v1/completions/fireworks. Over 5 minutes (the current timeout in the PR), this would obviously exceed 15k requests. I’m concerned that storing this many records in memory could be excessive, and iterating through the slice to determine model availability for each request might be even more critical performance wise.

…-tracker

taras-yemets · 2024-08-01T14:13:06Z

@chrsmith, following our conversation in https://github.com/sourcegraph/sourcegraph/pull/64133#discussion_r1697333723, I’ve decided to change our approach. We will now track the X percentage of failures within the last Y requests during the Z evaluation window. I believe this method is more effective than the following alternatives:

Tracking X consecutive failures within a timeframe (very fragile, as you correctly pointed out).
Tracking X failures within a timeframe (can lead to incorrect conclusions about model availability; for example, 10 failures out of 10 requests indicate a non-functional model, while 10 out of 1000 requests suggest the model is mostly available).
Tracking the percentage of failures out of all requests in a timeframe (our current method stores all requests in memory, potentially causing performance issues with a large number of requests).

I’ve also added some tests to confirm that the code works as expected. Could you please review the PR again?

If you agree with this approach, what do you think are the suitable values for X, Y, and Z? Currently, it's set to a 0.95 failure ratio out of 100 last requests within the last minute.

chrsmith

Some minor nitpicks, but this approach seems sound. Although needing to take a lock on every request isn't great... but at least it is specific to the model.

I'm sure we could spend more time on this and come up with a better or more sophisticated approach, but this seems practical without having any obvious flaws in the approach.

This PR suggests considering a model unavailable if 95% of the last 100 requests within the past minute either time out or return a 429 status code. I am not sure about these exact values and suggest them as a starting point for discussion

Tuning parameters like these is hard. I'd suggest just having a big obnoxiously long comment where we define them to whoever goes back to update them figure out the right choice.

I think your values for X, Y, and Z are reasonable. Here are some of my own thoughts:

IMHO 95% seems not great, but you should keep that value as-is. If we know that more than 50% of requests to a model are going to fail, we shouldn't be using it... but in reality, the client will just retry using that same model anyways. So it doesn't really benefit users for us to have a more sensitive value here. (But if we had fancier logic around using different models instead, or a better client-side experience when seeing disabled models, I can see us wanting to lower this threshold.)
The last 100 requests. Like you pointed out, this is a pretty low sample size, but it doesn't make sense to devote a ton of memory anyways. (And also we'll be supporting O(dozens) of LLM models as well.) So I think this seems reasonable.
One minute seems reasonable, but maybe too short? I suspect we'll really just be constrained by the number of records we store. So we'll probably only have "super recent" requests anyways. (e.g. we are receiving thousands of requests a minute, so we probably would be overwriting older records rather than having them "age out".)

chrsmith · 2024-08-02T19:06:51Z

+}
+
+func (mlt *modelsLoadTracker) getOrCreateCircuitBreakerForModel(model string) *modelCircuitBreaker {
+	mlt.mu.Lock()


Taking a lock every time we need to load a model circuit breaker isn't great. As it essentially means we can never do so in parallel.

If we are using a sync.RWMutext, consider using only RLock() or read-only lock. That will be the 99% case where we only want to get an existing value. (And there can be any number of concurrent reads.)

And if and only if the model isn't found, then we release the read-lock, and wait for a write-lock (r.Lock()) instead.

HOWEVER... rather than manually do all of this, we should probably just use the built in sync.Map type. Which exists to provide a safe-for-concurrency Map type, which is exactly what we are trying to do here.

https://pkg.go.dev/sync#Map

And then we'd just always call the Load method, and if the key wasn't found then call LoadOrStore. (We don't want to always call LoadOrStore, because that would require us to allocate the memory for adding a new ModelCircuitBreaker every type. When that's the var less common case.)

Anyways, in case you are curious, if you look at the implementation is kinda sorta what I was suggesting about only taking a read-lock, and then if the key isn't present, then taking a write-lock to add it. (Read: under the hood, it still uses a RWMutex. It's just that the sync.Map does some other things to make the common case faster.)
https://cs.opensource.google/go/go/+/refs/tags/go1.22.5:src/sync/map.go;l=203

chrsmith · 2024-08-02T19:15:52Z

+// record represents an individual error occurrence with details about the reason for the error
+// and the timestamp when it happened. This information is used to assess the model's availability.
+type record struct {
+	reason    string


I've left a comment elsewhere on the PR. I could be convinced either way how to approach this.

chrsmith · 2024-08-02T19:22:04Z

+			// Check if the record is a failure
+			if r.statusCode == http.StatusTooManyRequests || r.statusCode == http.StatusGatewayTimeout {


I think we need to be very careful here. Consider:

Having the if x == a || x == b || x == c { can get tedious. Using a switch statement can in some cases make this easier to read.

Since we are using the status code of '0' to denote that there was some failure even getting a response from upstream, we might want to consider those as failures, too?

Including the upstream LLM provider giving us a 500 seems like a good failure to be aware of as well.

switch r.statusCode { case http.StatusTooManyRequests, http.StatusGatewayTimeout, 0 /* no response obj */, http.StatusInternalServerError: failedCount++ }

I think this is the right approach, instead of say only considering 2xx requests as "non-failures". Because a user could have a misconfigured client, or just be sending bad requests resulting in a 4xx. And we shouldn't let errors on the client-side "poison" a model for all of Cody Gateway.

So being explicit and only considering a handful of HTTP responses as "failures that would count towards temporarily disabling the model" seems like a good choice.

Refactored to use a switch statement. I didn’t include the 0 case, as it also covers request errors like context.Canceled, which are unlikely to be upstream provider errors.

…-tracker

…h/sourcegraph into ty/model-availability-tracker

wip

42417dc

cla-bot Bot added the cla-signed label Jul 29, 2024

taras-yemets changed the title ~~wip~~ feat(cody): add circuit-breaker in Cody Gateway to handle overloaded models and ensure client requests fail fast Jul 29, 2024

chrsmith reviewed Jul 29, 2024

View reviewed changes

taras-yemets added 7 commits July 30, 2024 12:29

address feedback

b5d8ca2

address comments

c0d4a4a

remove debug statement

3ea14f0

rename and bazel configure

6080a04

upd comment

dcab231

upd comment

a2457c1

fix status code

4f22c3e

chrsmith approved these changes Jul 30, 2024

View reviewed changes

taras-yemets added 8 commits July 31, 2024 12:48

Merge remote-tracking branch 'origin/main' into ty/model-availability…

a1c741b

…-tracker

use http status codes as reasons

c1c1db4

rename variable

a7cbb00

Merge branch 'main' into ty/model-availability-tracker

389e99d

rm problematic file

707a022

refactor and add test

9fab02d

bazel configure

a18e29d

Merge remote-tracking branch 'origin/main' into ty/model-availability…

7c9b481

…-tracker

taras-yemets changed the title ~~feat(cody): add circuit-breaker in Cody Gateway to handle overloaded models and ensure client requests fail fast~~ feat(cody): add circuit breaker to handle timed-out requests and rate limit hits Aug 1, 2024

taras-yemets requested a review from chrsmith August 1, 2024 15:17

taras-yemets marked this pull request as ready for review August 1, 2024 15:17

Merge branch 'main' into ty/model-availability-tracker

2bb15fc

chrsmith approved these changes Aug 2, 2024

View reviewed changes

taras-yemets added 4 commits August 5, 2024 10:46

Merge remote-tracking branch 'origin/main' into ty/model-availability…

ce76fe5

…-tracker

improve comment

defbb50

Merge branch 'ty/model-availability-tracker' of github.com:sourcegrap…

82ded65

…h/sourcegraph into ty/model-availability-tracker

refactor to use switch

532d556

taras-yemets added 6 commits August 5, 2024 11:55

refactor to use circuit breaker by model as sync.Map

5f0ee91

Merge branch 'main' into ty/model-availability-tracker

410a984

fix test

dd7314c

Merge branch 'ty/model-availability-tracker' of github.com:sourcegrap…

315b990

…h/sourcegraph into ty/model-availability-tracker

upd comment

a288358

Merge branch 'main' into ty/model-availability-tracker

07505ab

taras-yemets enabled auto-merge (squash) August 5, 2024 11:29

taras-yemets merged commit d19aa10 into main Aug 5, 2024

taras-yemets deleted the ty/model-availability-tracker branch August 5, 2024 11:38

		// Check if the record is a failure
		if r.statusCode == http.StatusTooManyRequests \|\| r.statusCode == http.StatusGatewayTimeout {

Conversation

taras-yemets commented Jul 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Changelog

Uh oh!

chrsmith left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chrsmith Jul 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chrsmith left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

taras-yemets commented Aug 1, 2024

Uh oh!

chrsmith left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

taras-yemets commented Jul 29, 2024 •

edited

Loading

chrsmith Jul 31, 2024 •

edited

Loading