Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

feat(cody): add circuit breaker to handle timed-out requests and rate limit hits#64133

Merged
taras-yemets merged 27 commits into
mainfrom
ty/model-availability-tracker
Aug 5, 2024
Merged

feat(cody): add circuit breaker to handle timed-out requests and rate limit hits#64133
taras-yemets merged 27 commits into
mainfrom
ty/model-availability-tracker

Conversation

@taras-yemets

@taras-yemets taras-yemets commented Jul 29, 2024

Copy link
Copy Markdown
Contributor

Closes https://linear.app/sourcegraph/issue/CODY-2758/[autocomplete-latency]-add-circuit-breaker-in-cody-gateway-to-handle

This PR introduces an in-memory model availability tracker to ensure we do not send consequent requests to the currently unavailable LLM providers.

The tracker maintains a history of error records for each model. It evaluates these records to determine whether the model is available for new requests. The evaluation follows these steps:

  1. For every request to an upstream provider, the tracker records any errors that occur. Specifically, it logs timeout errors (when a request exceeds its deadline) and responses with a 429 status code (Too Many Requests).
  2. These error records are stored in a circular buffer for each model. This buffer holds a fixed number of records, ensuring efficient memory usage.
  3. The tracker calculates the failure ratio by analyzing the stored records. It checks the percentage of errors within a specified evaluation window and compares this against the total number of recent requests.
  4. Based on the calculated failure ratio, the tracker decides whether the model is available:
    - Model Unavailable: If the ratio of failures (timeouts or 429 status codes) exceeds a predefined threshold (X%), the model is marked as unavailable. In this state, the system does not send new requests to the upstream provider.
    When a model is unavailable, the system immediately returns an error status code, typically a 503 Service Unavailable, to the client. This informs the client that the service is temporarily unavailable due to upstream issues.
    - Model Available: If the failure ratio is within acceptable limits, the system proceeds with sending the request to the upstream provider.

This PR suggests considering a model unavailable if 95% of the last 100 requests within the past minute either time out or return a 429 status code. I am not sure about these exact values and suggest them as a starting point for discussion

Test plan

  • Added unit tests
  • CI

Changelog

@cla-bot cla-bot Bot added the cla-signed label Jul 29, 2024
@taras-yemets taras-yemets changed the title wip feat(cody): add circuit-breaker in Cody Gateway to handle overloaded models and ensure client requests fail fast Jul 29, 2024

@chrsmith chrsmith left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely seems like it's a good approach, and on the right track.

My only substantive feedback is to consider revisiting the data structure you are using for keeping track of the errors so that it can take up less space and/or be more efficient. (Since continuously copying/resizing slices isn't great.)

e.g. consider if we instead just made these all fixed-sizes using a Golang fixed-size array instead of a slice and/or using a proper circular buffer.

One thing that does feel a little off though, is that this is inherently going to run into data races. As multiple concurrent requests go to report their results. We need to use a mutex or channel to ensure that we have one writer at a time. (Which could lead to increasing latency due to contention.) Or use some sort of lock-free data structure, and/or more fancy approach.

But we probably should just use a mutex since we only would write the record in the event of the upstream request failing, right? So for the common-case, when we get a 200, there won't be any overhead at all. Right?

// errorThresholdForUnavailability is the maximum number of error records
// allowed within the modelAvailabilityCheckWindow before a model is
// considered unavailable.
const errorThresholdForUnavailability = 10

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an OK starting point. But this is one of those things that doesn't scale very well.

e.g. imagine we get 100x the number of requests, because of generic growth. Then, even if the transient error rate is 0.0001%, eventually you'd hit this static threshold.

So again, this is fine for now. But just keep in the back of your mind that we'll want to make this a bit more robust in the future. Perhaps by checking a "percentage of all requests within that time window", etc.

Comment thread cmd/cody-gateway/internal/httpapi/completions/recorder.go Outdated

func newModelsLoadTracker() *modelsLoadTracker {
return &modelsLoadTracker{
records: map[string][]*record{},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor-nit: This is totally OK. But I'm guessing that we have some idea of how many records we want to store. So it would be a tiny bit more efficient to pre-allocate the space. e.g. make(map[string][]*record{}, 1024).

Note that you only really benefit from that if the map will grow considerably (e.g. contain thousands of elements. If we only expect ~dozens or ~hundreds of elements, then this would fall into the 🤷 not worth it camp.)

Comment thread cmd/cody-gateway/internal/httpapi/completions/recorder.go Outdated
}

records := mlt.records[gatewayModel]
records = append([]*record{r}, records...)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ this is really bad from an efficiency perspective.

  • You are creating a new slice with a single element. ([]*record{r})
  • You are then appending records to the end of it. (Which will need to resize the newly created slice, and copy over all the elements.)

Instead, you should just write this as:

records = append(records, r)

This will keep the data in the records slice as-is, and only copy the new element over. If the underlying slice didn't have enough space, only then would it "copy all the elements into a new slice".

Does that all make sense?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed when implementing suggestion from https://github.com/sourcegraph/sourcegraph/pull/64133#discussion_r1695520961 by creating a slice with a predefined length and assigning new values by index instead of appending. Is this approach sustainable?

https://github.com/sourcegraph/sourcegraph/blob/6080a047e634cbf76d755bfd44974795423105a1/cmd/cody-gateway/internal/httpapi/completions/modelsloadtracker.go#L82-L96

@chrsmith chrsmith Jul 31, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by creating a slice with a predefined length and assigning new values by index instead of appending. Is this approach sustainable?

Yup, SGTM. I think it would be arguably, slightly cleaner to use a fixed-width array. (e.g. the type of records being [20]record.) But that's more of a matter of style.

The key thing is that once we introduce the modelCircuitBreaker, that we don't do a bunch of unnecessary copying/resizing of the slice. (Which we avoid by using that headIndex and just overwriting previous failures.)

Comment thread cmd/cody-gateway/internal/httpapi/completions/recorder.go Outdated
Comment thread cmd/cody-gateway/internal/httpapi/completions/recorder.go Outdated
Comment thread cmd/cody-gateway/internal/httpapi/completions/upstream.go
Comment thread cmd/cody-gateway/internal/httpapi/completions/upstream.go Outdated
Comment thread cmd/cody-gateway/internal/httpapi/completions/upstream.go Outdated

@chrsmith chrsmith left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of nitpicks, but this code generally looks good to me. However, I left a couple of important comments with a ⚠️ since I'm not entirely sure we want to land the logic that you have here. (But if I am mistaken, or this is intentional, go for it.)

Also, this seems like a situation where some unit tests would be appreciated. (Hitting the sweet spot where the code is sophisticated enough that they would be helpful to confirm things work; but the code is also simple enough that writing the tests wouldn't be a major PITA 😅 )

// pattern to temporarily mark models as unavailable if they exceed the allowed number
// of errors within the defined time window.
type modelsLoadTracker struct {
mu sync.RWMutex

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving the mutex down into the modelCircuitBreaker. That way we can record failures for two different models simultaneously. It's just a way of further reducing the possibility of contention on the lock.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the updated updated implementation (https://github.com/sourcegraph/sourcegraph/pull/64133#issuecomment-2263187863) we have two mutexes: one for modelsLoadTracker (to lock on modelCircuitBreaker creation) and one on modelCircuitBreaker to lock on adding record and calculating the failure ratio. Does this approach sound reasonable to you?


// failureThreshold represents the maximum number of error records
// allowed within the timeout before a model is considered unavailable.
failureThreshold int

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: What do you think about maxFailures or even maxAllowedFailures as a name? failureThreshold is definitely OK. But max-* might seem like a better fit? WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the updated implementation (https://github.com/sourcegraph/sourcegraph/pull/64133#issuecomment-2263187863) I renamed this to failureRatio, WDYT?

Comment thread cmd/cody-gateway/internal/httpapi/completions/modelsloadtracker.go Outdated
Comment thread cmd/cody-gateway/internal/httpapi/completions/modelsloadtracker.go Outdated
// record represents an individual error occurrence with details about the reason for the error
// and the timestamp when it happened. This information is used to assess the model's availability.
type record struct {
reason string

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: having reason as a string is helpful for the developer, but probably is just more overhead than is useful. Not that we need to worry about being ultra-efficient here, but if we just had this as statusCode int and just kept track of whatever the "non-2xx" HTTP response status code was, that'd be sufficient IMHO. But this certainly will make it easier to diagnose.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use HTTP status codes as reasons in https://github.com/sourcegraph/sourcegraph/pull/64133/commits/c1c1db4cd9865c06998fca286634696a423723e7.

just kept track of whatever the "non-2xx" HTTP response status code was

We consider the a model unavailable if requests to it fail because of a context.DeadlineExceeded error or result in responses with 429 status codes. We ignore other non-2xx status codes. Do you think we should track all non-2xx responses?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT, @chrsmith?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a comment elsewhere on the PR. I could be convinced either way how to approach this.

Comment thread cmd/cody-gateway/internal/httpapi/completions/modelsloadtracker.go Outdated
Comment thread cmd/cody-gateway/internal/httpapi/completions/modelsloadtracker.go Outdated
Comment thread cmd/cody-gateway/internal/httpapi/completions/modelsloadtracker.go Outdated
}

// isModelAvailable checks if a model is available based on the number of failures within the specified timeout period.
// Returns false if there is at least one failure within the timeout period. Otherwise, returns true.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Is this the behavior that we want? I guess I misunderstood the exact way you were thinking about enforcing the circuit breaker. But I think we should expect some non-zero amount of "random, transient bullshit" from the upstream LLM provider. (e.g. a network cable gets cut, or some random hardware outage on their side.)

So having Cody Gateway consider a model unavailable with as few as a single error seems too sensitive. I thought we would instead require failureThreshold errors within timeout seconds. Does that seem like a better approach, or am I missing something?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the behavior that we want?

Same as https://github.com/sourcegraph/sourcegraph/pull/64133#discussion_r1698293305.
Thanks for taking another look at the PR, I'm happy that we have this discussion on the PR, but not as a post-mortem 😅

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would instead require failureThreshold errors within timeout seconds

I don't think it'll work. Let's assume the failureThreshold is 10.
If within the timeout period we get 10,000 requests and only record failureThreshold number of errors, the model should still be considered available (as only 0.001% of requests indicate an issue). However, if within the timeout period we get 12 requests and 10 of them result in a 429 error, the model should be considered temporarily unavailable.

So, maybe we should implement the "percentage of all requests within the time window" approach you suggested in https://github.com/sourcegraph/sourcegraph/pull/64133#discussion_r1695504899?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implement the "percentage of all requests within the time window" approach

I’ve started exploring the implementation of X percent of failing requests out of the total requests within a Y-minute timeframe. I checked the Cody Gateway logs and noticed that it receives over 3k requests per minute to /v1/completions/fireworks. Over 5 minutes (the current timeout in the PR), this would obviously exceed 15k requests. I’m concerned that storing this many records in memory could be excessive, and iterating through the slice to determine model availability for each request might be even more critical performance wise.

Comment thread cmd/cody-gateway/internal/httpapi/completions/modelsloadtracker.go Outdated
@taras-yemets

Copy link
Copy Markdown
Contributor Author

@chrsmith, following our conversation in https://github.com/sourcegraph/sourcegraph/pull/64133#discussion_r1697333723, I’ve decided to change our approach. We will now track the X percentage of failures within the last Y requests during the Z evaluation window. I believe this method is more effective than the following alternatives:

  • Tracking X consecutive failures within a timeframe (very fragile, as you correctly pointed out).
  • Tracking X failures within a timeframe (can lead to incorrect conclusions about model availability; for example, 10 failures out of 10 requests indicate a non-functional model, while 10 out of 1000 requests suggest the model is mostly available).
  • Tracking the percentage of failures out of all requests in a timeframe (our current method stores all requests in memory, potentially causing performance issues with a large number of requests).

I’ve also added some tests to confirm that the code works as expected. Could you please review the PR again?

If you agree with this approach, what do you think are the suitable values for X, Y, and Z? Currently, it's set to a 0.95 failure ratio out of 100 last requests within the last minute.

@taras-yemets taras-yemets changed the title feat(cody): add circuit-breaker in Cody Gateway to handle overloaded models and ensure client requests fail fast feat(cody): add circuit breaker to handle timed-out requests and rate limit hits Aug 1, 2024
@taras-yemets taras-yemets requested a review from chrsmith August 1, 2024 15:17
@taras-yemets taras-yemets marked this pull request as ready for review August 1, 2024 15:17

@chrsmith chrsmith left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nitpicks, but this approach seems sound. Although needing to take a lock on every request isn't great... but at least it is specific to the model.

I'm sure we could spend more time on this and come up with a better or more sophisticated approach, but this seems practical without having any obvious flaws in the approach. :shipit:

This PR suggests considering a model unavailable if 95% of the last 100 requests within the past minute either time out or return a 429 status code. I am not sure about these exact values and suggest them as a starting point for discussion

Tuning parameters like these is hard. I'd suggest just having a big obnoxiously long comment where we define them to whoever goes back to update them figure out the right choice.

I think your values for X, Y, and Z are reasonable. Here are some of my own thoughts:

  • IMHO 95% seems not great, but you should keep that value as-is. If we know that more than 50% of requests to a model are going to fail, we shouldn't be using it... but in reality, the client will just retry using that same model anyways. So it doesn't really benefit users for us to have a more sensitive value here. (But if we had fancier logic around using different models instead, or a better client-side experience when seeing disabled models, I can see us wanting to lower this threshold.)

  • The last 100 requests. Like you pointed out, this is a pretty low sample size, but it doesn't make sense to devote a ton of memory anyways. (And also we'll be supporting O(dozens) of LLM models as well.) So I think this seems reasonable.

  • One minute seems reasonable, but maybe too short? I suspect we'll really just be constrained by the number of records we store. So we'll probably only have "super recent" requests anyways. (e.g. we are receiving thousands of requests a minute, so we probably would be overwriting older records rather than having them "age out".)

Comment thread cmd/cody-gateway/internal/httpapi/completions/loadtracker.go
Comment thread cmd/cody-gateway/internal/httpapi/completions/loadtracker.go Outdated
}

func (mlt *modelsLoadTracker) getOrCreateCircuitBreakerForModel(model string) *modelCircuitBreaker {
mlt.mu.Lock()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a lock every time we need to load a model circuit breaker isn't great. As it essentially means we can never do so in parallel.

If we are using a sync.RWMutext, consider using only RLock() or read-only lock. That will be the 99% case where we only want to get an existing value. (And there can be any number of concurrent reads.)

And if and only if the model isn't found, then we release the read-lock, and wait for a write-lock (r.Lock()) instead.

HOWEVER... rather than manually do all of this, we should probably just use the built in sync.Map type. Which exists to provide a safe-for-concurrency Map type, which is exactly what we are trying to do here.

https://pkg.go.dev/sync#Map

And then we'd just always call the Load method, and if the key wasn't found then call LoadOrStore. (We don't want to always call LoadOrStore, because that would require us to allocate the memory for adding a new ModelCircuitBreaker every type. When that's the var less common case.)

Anyways, in case you are curious, if you look at the implementation is kinda sorta what I was suggesting about only taking a read-lock, and then if the key isn't present, then taking a write-lock to add it. (Read: under the hood, it still uses a RWMutex. It's just that the sync.Map does some other things to make the common case faster.)
https://cs.opensource.google/go/go/+/refs/tags/go1.22.5:src/sync/map.go;l=203

// record represents an individual error occurrence with details about the reason for the error
// and the timestamp when it happened. This information is used to assess the model's availability.
type record struct {
reason string

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a comment elsewhere on the PR. I could be convinced either way how to approach this.

Comment on lines +138 to +139
// Check if the record is a failure
if r.statusCode == http.StatusTooManyRequests || r.statusCode == http.StatusGatewayTimeout {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to be very careful here. Consider:

  • Having the if x == a || x == b || x == c { can get tedious. Using a switch statement can in some cases make this easier to read.
  • Since we are using the status code of '0' to denote that there was some failure even getting a response from upstream, we might want to consider those as failures, too?
  • Including the upstream LLM provider giving us a 500 seems like a good failure to be aware of as well.
switch r.statusCode {
    case http.StatusTooManyRequests, http.StatusGatewayTimeout, 0 /* no response obj */, http.StatusInternalServerError:
        failedCount++
}

I think this is the right approach, instead of say only considering 2xx requests as "non-failures". Because a user could have a misconfigured client, or just be sending bad requests resulting in a 4xx. And we shouldn't let errors on the client-side "poison" a model for all of Cody Gateway.

So being explicit and only considering a handful of HTTP responses as "failures that would count towards temporarily disabling the model" seems like a good choice.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to use a switch statement. I didn’t include the 0 case, as it also covers request errors like context.Canceled, which are unlikely to be upstream provider errors.

@taras-yemets taras-yemets enabled auto-merge (squash) August 5, 2024 11:29
@taras-yemets taras-yemets merged commit d19aa10 into main Aug 5, 2024
@taras-yemets taras-yemets deleted the ty/model-availability-tracker branch August 5, 2024 11:38
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants