feat(cody): add circuit breaker to handle timed-out requests and rate limit hits#64133
Conversation
chrsmith
left a comment
There was a problem hiding this comment.
This definitely seems like it's a good approach, and on the right track.
My only substantive feedback is to consider revisiting the data structure you are using for keeping track of the errors so that it can take up less space and/or be more efficient. (Since continuously copying/resizing slices isn't great.)
e.g. consider if we instead just made these all fixed-sizes using a Golang fixed-size array instead of a slice and/or using a proper circular buffer.
One thing that does feel a little off though, is that this is inherently going to run into data races. As multiple concurrent requests go to report their results. We need to use a mutex or channel to ensure that we have one writer at a time. (Which could lead to increasing latency due to contention.) Or use some sort of lock-free data structure, and/or more fancy approach.
But we probably should just use a mutex since we only would write the record in the event of the upstream request failing, right? So for the common-case, when we get a 200, there won't be any overhead at all. Right?
| // errorThresholdForUnavailability is the maximum number of error records | ||
| // allowed within the modelAvailabilityCheckWindow before a model is | ||
| // considered unavailable. | ||
| const errorThresholdForUnavailability = 10 |
There was a problem hiding this comment.
This is an OK starting point. But this is one of those things that doesn't scale very well.
e.g. imagine we get 100x the number of requests, because of generic growth. Then, even if the transient error rate is 0.0001%, eventually you'd hit this static threshold.
So again, this is fine for now. But just keep in the back of your mind that we'll want to make this a bit more robust in the future. Perhaps by checking a "percentage of all requests within that time window", etc.
|
|
||
| func newModelsLoadTracker() *modelsLoadTracker { | ||
| return &modelsLoadTracker{ | ||
| records: map[string][]*record{}, |
There was a problem hiding this comment.
Minor-nit: This is totally OK. But I'm guessing that we have some idea of how many records we want to store. So it would be a tiny bit more efficient to pre-allocate the space. e.g. make(map[string][]*record{}, 1024).
Note that you only really benefit from that if the map will grow considerably (e.g. contain thousands of elements. If we only expect ~dozens or ~hundreds of elements, then this would fall into the 🤷 not worth it camp.)
| } | ||
|
|
||
| records := mlt.records[gatewayModel] | ||
| records = append([]*record{r}, records...) |
There was a problem hiding this comment.
- You are creating a new slice with a single element. (
[]*record{r}) - You are then appending
recordsto the end of it. (Which will need to resize the newly created slice, and copy over all the elements.)
Instead, you should just write this as:
records = append(records, r)This will keep the data in the records slice as-is, and only copy the new element over. If the underlying slice didn't have enough space, only then would it "copy all the elements into a new slice".
Does that all make sense?
There was a problem hiding this comment.
Addressed when implementing suggestion from https://github.com/sourcegraph/sourcegraph/pull/64133#discussion_r1695520961 by creating a slice with a predefined length and assigning new values by index instead of appending. Is this approach sustainable?
There was a problem hiding this comment.
by creating a slice with a predefined length and assigning new values by index instead of appending. Is this approach sustainable?
Yup, SGTM. I think it would be arguably, slightly cleaner to use a fixed-width array. (e.g. the type of records being [20]record.) But that's more of a matter of style.
The key thing is that once we introduce the modelCircuitBreaker, that we don't do a bunch of unnecessary copying/resizing of the slice. (Which we avoid by using that headIndex and just overwriting previous failures.)
chrsmith
left a comment
There was a problem hiding this comment.
Lots of nitpicks, but this code generally looks good to me. However, I left a couple of important comments with a
Also, this seems like a situation where some unit tests would be appreciated. (Hitting the sweet spot where the code is sophisticated enough that they would be helpful to confirm things work; but the code is also simple enough that writing the tests wouldn't be a major PITA 😅 )
| // pattern to temporarily mark models as unavailable if they exceed the allowed number | ||
| // of errors within the defined time window. | ||
| type modelsLoadTracker struct { | ||
| mu sync.RWMutex |
There was a problem hiding this comment.
Consider moving the mutex down into the modelCircuitBreaker. That way we can record failures for two different models simultaneously. It's just a way of further reducing the possibility of contention on the lock.
There was a problem hiding this comment.
With the updated updated implementation (https://github.com/sourcegraph/sourcegraph/pull/64133#issuecomment-2263187863) we have two mutexes: one for modelsLoadTracker (to lock on modelCircuitBreaker creation) and one on modelCircuitBreaker to lock on adding record and calculating the failure ratio. Does this approach sound reasonable to you?
|
|
||
| // failureThreshold represents the maximum number of error records | ||
| // allowed within the timeout before a model is considered unavailable. | ||
| failureThreshold int |
There was a problem hiding this comment.
Nit: What do you think about maxFailures or even maxAllowedFailures as a name? failureThreshold is definitely OK. But max-* might seem like a better fit? WDYT?
There was a problem hiding this comment.
With the updated implementation (https://github.com/sourcegraph/sourcegraph/pull/64133#issuecomment-2263187863) I renamed this to failureRatio, WDYT?
| // record represents an individual error occurrence with details about the reason for the error | ||
| // and the timestamp when it happened. This information is used to assess the model's availability. | ||
| type record struct { | ||
| reason string |
There was a problem hiding this comment.
Nit: having reason as a string is helpful for the developer, but probably is just more overhead than is useful. Not that we need to worry about being ultra-efficient here, but if we just had this as statusCode int and just kept track of whatever the "non-2xx" HTTP response status code was, that'd be sufficient IMHO. But this certainly will make it easier to diagnose.
There was a problem hiding this comment.
Changed to use HTTP status codes as reasons in https://github.com/sourcegraph/sourcegraph/pull/64133/commits/c1c1db4cd9865c06998fca286634696a423723e7.
just kept track of whatever the "non-2xx" HTTP response status code was
We consider the a model unavailable if requests to it fail because of a context.DeadlineExceeded error or result in responses with 429 status codes. We ignore other non-2xx status codes. Do you think we should track all non-2xx responses?
There was a problem hiding this comment.
I've left a comment elsewhere on the PR. I could be convinced either way how to approach this.
| } | ||
|
|
||
| // isModelAvailable checks if a model is available based on the number of failures within the specified timeout period. | ||
| // Returns false if there is at least one failure within the timeout period. Otherwise, returns true. |
There was a problem hiding this comment.
So having Cody Gateway consider a model unavailable with as few as a single error seems too sensitive. I thought we would instead require failureThreshold errors within timeout seconds. Does that seem like a better approach, or am I missing something?
There was a problem hiding this comment.
Is this the behavior that we want?
Same as https://github.com/sourcegraph/sourcegraph/pull/64133#discussion_r1698293305.
Thanks for taking another look at the PR, I'm happy that we have this discussion on the PR, but not as a post-mortem 😅
There was a problem hiding this comment.
we would instead require
failureThresholderrors within timeout seconds
I don't think it'll work. Let's assume the failureThreshold is 10.
If within the timeout period we get 10,000 requests and only record failureThreshold number of errors, the model should still be considered available (as only 0.001% of requests indicate an issue). However, if within the timeout period we get 12 requests and 10 of them result in a 429 error, the model should be considered temporarily unavailable.
So, maybe we should implement the "percentage of all requests within the time window" approach you suggested in https://github.com/sourcegraph/sourcegraph/pull/64133#discussion_r1695504899?
There was a problem hiding this comment.
implement the "percentage of all requests within the time window" approach
I’ve started exploring the implementation of X percent of failing requests out of the total requests within a Y-minute timeframe. I checked the Cody Gateway logs and noticed that it receives over 3k requests per minute to /v1/completions/fireworks. Over 5 minutes (the current timeout in the PR), this would obviously exceed 15k requests. I’m concerned that storing this many records in memory could be excessive, and iterating through the slice to determine model availability for each request might be even more critical performance wise.
|
@chrsmith, following our conversation in https://github.com/sourcegraph/sourcegraph/pull/64133#discussion_r1697333723, I’ve decided to change our approach. We will now track the X percentage of failures within the last Y requests during the Z evaluation window. I believe this method is more effective than the following alternatives:
I’ve also added some tests to confirm that the code works as expected. Could you please review the PR again? If you agree with this approach, what do you think are the suitable values for X, Y, and Z? Currently, it's set to a 0.95 failure ratio out of 100 last requests within the last minute. |
chrsmith
left a comment
There was a problem hiding this comment.
Some minor nitpicks, but this approach seems sound. Although needing to take a lock on every request isn't great... but at least it is specific to the model.
I'm sure we could spend more time on this and come up with a better or more sophisticated approach, but this seems practical without having any obvious flaws in the approach. ![]()
This PR suggests considering a model unavailable if 95% of the last 100 requests within the past minute either time out or return a 429 status code. I am not sure about these exact values and suggest them as a starting point for discussion
Tuning parameters like these is hard. I'd suggest just having a big obnoxiously long comment where we define them to whoever goes back to update them figure out the right choice.
I think your values for X, Y, and Z are reasonable. Here are some of my own thoughts:
-
IMHO 95% seems not great, but you should keep that value as-is. If we know that more than 50% of requests to a model are going to fail, we shouldn't be using it... but in reality, the client will just retry using that same model anyways. So it doesn't really benefit users for us to have a more sensitive value here. (But if we had fancier logic around using different models instead, or a better client-side experience when seeing disabled models, I can see us wanting to lower this threshold.)
-
The last 100 requests. Like you pointed out, this is a pretty low sample size, but it doesn't make sense to devote a ton of memory anyways. (And also we'll be supporting O(dozens) of LLM models as well.) So I think this seems reasonable.
-
One minute seems reasonable, but maybe too short? I suspect we'll really just be constrained by the number of records we store. So we'll probably only have "super recent" requests anyways. (e.g. we are receiving thousands of requests a minute, so we probably would be overwriting older records rather than having them "age out".)
| } | ||
|
|
||
| func (mlt *modelsLoadTracker) getOrCreateCircuitBreakerForModel(model string) *modelCircuitBreaker { | ||
| mlt.mu.Lock() |
There was a problem hiding this comment.
Taking a lock every time we need to load a model circuit breaker isn't great. As it essentially means we can never do so in parallel.
If we are using a sync.RWMutext, consider using only RLock() or read-only lock. That will be the 99% case where we only want to get an existing value. (And there can be any number of concurrent reads.)
And if and only if the model isn't found, then we release the read-lock, and wait for a write-lock (r.Lock()) instead.
HOWEVER... rather than manually do all of this, we should probably just use the built in sync.Map type. Which exists to provide a safe-for-concurrency Map type, which is exactly what we are trying to do here.
And then we'd just always call the Load method, and if the key wasn't found then call LoadOrStore. (We don't want to always call LoadOrStore, because that would require us to allocate the memory for adding a new ModelCircuitBreaker every type. When that's the var less common case.)
Anyways, in case you are curious, if you look at the implementation is kinda sorta what I was suggesting about only taking a read-lock, and then if the key isn't present, then taking a write-lock to add it. (Read: under the hood, it still uses a RWMutex. It's just that the sync.Map does some other things to make the common case faster.)
https://cs.opensource.google/go/go/+/refs/tags/go1.22.5:src/sync/map.go;l=203
| // record represents an individual error occurrence with details about the reason for the error | ||
| // and the timestamp when it happened. This information is used to assess the model's availability. | ||
| type record struct { | ||
| reason string |
There was a problem hiding this comment.
I've left a comment elsewhere on the PR. I could be convinced either way how to approach this.
| // Check if the record is a failure | ||
| if r.statusCode == http.StatusTooManyRequests || r.statusCode == http.StatusGatewayTimeout { |
There was a problem hiding this comment.
I think we need to be very careful here. Consider:
- Having the
if x == a || x == b || x == c {can get tedious. Using aswitchstatement can in some cases make this easier to read. - Since we are using the status code of '0' to denote that there was some failure even getting a response from upstream, we might want to consider those as failures, too?
- Including the upstream LLM provider giving us a 500 seems like a good failure to be aware of as well.
switch r.statusCode {
case http.StatusTooManyRequests, http.StatusGatewayTimeout, 0 /* no response obj */, http.StatusInternalServerError:
failedCount++
}I think this is the right approach, instead of say only considering 2xx requests as "non-failures". Because a user could have a misconfigured client, or just be sending bad requests resulting in a 4xx. And we shouldn't let errors on the client-side "poison" a model for all of Cody Gateway.
So being explicit and only considering a handful of HTTP responses as "failures that would count towards temporarily disabling the model" seems like a good choice.
There was a problem hiding this comment.
Refactored to use a switch statement. I didn’t include the 0 case, as it also covers request errors like context.Canceled, which are unlikely to be upstream provider errors.
…h/sourcegraph into ty/model-availability-tracker
…h/sourcegraph into ty/model-availability-tracker
Closes https://linear.app/sourcegraph/issue/CODY-2758/[autocomplete-latency]-add-circuit-breaker-in-cody-gateway-to-handle
This PR introduces an in-memory model availability tracker to ensure we do not send consequent requests to the currently unavailable LLM providers.
The tracker maintains a history of error records for each model. It evaluates these records to determine whether the model is available for new requests. The evaluation follows these steps:
- Model Unavailable: If the ratio of failures (timeouts or 429 status codes) exceeds a predefined threshold (X%), the model is marked as unavailable. In this state, the system does not send new requests to the upstream provider.
When a model is unavailable, the system immediately returns an error status code, typically a 503 Service Unavailable, to the client. This informs the client that the service is temporarily unavailable due to upstream issues.
- Model Available: If the failure ratio is within acceptable limits, the system proceeds with sending the request to the upstream provider.
This PR suggests considering a model unavailable if 95% of the last 100 requests within the past minute either time out or return a 429 status code. I am not sure about these exact values and suggest them as a starting point for discussion
Test plan
Changelog