Search: revert automatic retries#59603
Conversation
ggilmore
left a comment
There was a problem hiding this comment.
In general, Zoekt has special
handling for when replicas are unavailable, and we don't want to use this
general default.
I'm curious, what does this logic look like? (Has the logic that detects when a replica is unavailable been adapted for gRPC errors)? If not, that might be something you want to look into.
| } | ||
|
|
||
| func (a *automaticRetryClient) Search(ctx context.Context, in *proto.SearchRequest, opts ...grpc.CallOption) (*proto.SearchResponse, error) { | ||
| opts = append(defaults.RetryPolicy, opts...) |
There was a problem hiding this comment.
Instead of simply deleting the automaticRetryClient, I think that it might be better to:
-
"disable" automatic retries by making each of these methods a pass-through method to the underlying client (not adding in the defaults.RetryPolicy call options)
func (a *automaticRetryClient) Search(ctx context.Context, in *proto.SearchRequest, opts ...grpc.CallOption) (*proto.SearchResponse, error) { return a.base.Search(ctx, in, opts...) }
You can see how I do this for some gitserver methods here: https://github.com/sourcegraph/sourcegraph/blob/95a44f053d003699dc6dcbca999671c3e9638307/internal/gitserver/retry.go#L22-L39
-
Document for each method why we aren't using the default retry policy. In this case, it's not because the methods aren't idempotent, but instead because there is "higher level" retry policy in the application.
You can see a similar comment that Ieft for gitserver's CreateCommitFromPatchBinary method to explain why I wasn't using the automatic retry support with it: https://github.com/sourcegraph/sourcegraph/blob/95a44f053d003699dc6dcbca999671c3e9638307/internal/gitserver/retry.go#L35-L39
func (r *automaticRetryClient) CreateCommitFromPatchBinary(ctx context.Context, opts ...grpc.CallOption) (proto.GitserverService_CreateCommitFromPatchBinaryClient, error) { // CreateCommitFromPatchBinary isn't idempotent. It also is a client-streaming method, which is currently unsupported by our automatic retry logic. // The caller is responsible for implementing their own retry semantics for this method. return r.base.CreateCommitFromPatchBinary(ctx, opts...) }
I believe this approach is more discoverable since it uses the same standardized pattern that other clients use. It's also a natural place that someone would look for documentation about retries.
There was a problem hiding this comment.
Thanks for the detailed review! For this PR, I would like to completely remove the retry behavior when connecting to Zoekt, as opposed to doing it only for some methods. This matches the old behavior most closely, and will let us test this as a potential source of the instability/ latency spikes we've seen on dot com. As a follow-up, we can look into re-introducing retries for some methods, for example List.
Since we're totally eliminating retries, I think it makes the most sense to not keep wrapping things in a "retry client". However, I'll definitely add a comment here highlighting that Zoekt is taking a non-default approach!
During a search, we check whether we can ignore an error and return partial results: https://github.com/sourcegraph/sourcegraph/blob/1e1d1db66424327e51fbcd014d201e1f6539b13e/internal/search/backend/horizontal.go#L175-L178 Good question, I'm not sure if these error checks are still valid for gRPC... will follow up on this. |
|
Following up: I filed https://github.com/sourcegraph/sourcegraph/issues/59910 so we don't forget to dig into this. I wasn't able to confirm the error-handling logic is actually still working. |
This reverts #59133, which added automatic retries for the gRPC client that
Zoekt uses.
As noted in https://github.com/sourcegraph/sourcegraph/pull/59133#issuecomment-1867109359, this is a change in behavior from before. We
suspect that when Zoekt is overloaded, it can appear as "unavailable", which
causes us to automatically retry searches. This in turn puts more load on
Zoekt, keeping it in an overloaded state. In general, Zoekt has special
handling for when replicas are unavailable, and we don't want to use this
general default.
Test plan
Covered by existing tests