Search: revert automatic retries by jtibshirani · Pull Request #59603 · sourcegraph/sourcegraph-public-snapshot

jtibshirani · 2024-01-15T19:40:35Z

This reverts #59133, which added automatic retries for the gRPC client that
Zoekt uses.

As noted in https://github.com/sourcegraph/sourcegraph/pull/59133#issuecomment-1867109359, this is a change in behavior from before. We
suspect that when Zoekt is overloaded, it can appear as "unavailable", which
causes us to automatically retry searches. This in turn puts more load on
Zoekt, keeping it in an overloaded state. In general, Zoekt has special
handling for when replicas are unavailable, and we don't want to use this
general default.

Test plan

Covered by existing tests

ggilmore

In general, Zoekt has special
handling for when replicas are unavailable, and we don't want to use this
general default.

I'm curious, what does this logic look like? (Has the logic that detects when a replica is unavailable been adapted for gRPC errors)? If not, that might be something you want to look into.

ggilmore · 2024-01-15T21:56:50Z

-}
-
-func (a *automaticRetryClient) Search(ctx context.Context, in *proto.SearchRequest, opts ...grpc.CallOption) (*proto.SearchResponse, error) {
-	opts = append(defaults.RetryPolicy, opts...)


Instead of simply deleting the automaticRetryClient, I think that it might be better to:

"disable" automatic retries by making each of these methods a pass-through method to the underlying client (not adding in the defaults.RetryPolicy call options)

func (a *automaticRetryClient) Search(ctx context.Context, in *proto.SearchRequest, opts ...grpc.CallOption) (*proto.SearchResponse, error) { return a.base.Search(ctx, in, opts...) }

You can see how I do this for some gitserver methods here: https://github.com/sourcegraph/sourcegraph/blob/95a44f053d003699dc6dcbca999671c3e9638307/internal/gitserver/retry.go#L22-L39

Document for each method why we aren't using the default retry policy. In this case, it's not because the methods aren't idempotent, but instead because there is "higher level" retry policy in the application.

You can see a similar comment that Ieft for gitserver's CreateCommitFromPatchBinary method to explain why I wasn't using the automatic retry support with it: https://github.com/sourcegraph/sourcegraph/blob/95a44f053d003699dc6dcbca999671c3e9638307/internal/gitserver/retry.go#L35-L39

func (r *automaticRetryClient) CreateCommitFromPatchBinary(ctx context.Context, opts ...grpc.CallOption) (proto.GitserverService_CreateCommitFromPatchBinaryClient, error) { // CreateCommitFromPatchBinary isn't idempotent. It also is a client-streaming method, which is currently unsupported by our automatic retry logic. // The caller is responsible for implementing their own retry semantics for this method. return r.base.CreateCommitFromPatchBinary(ctx, opts...) }

I believe this approach is more discoverable since it uses the same standardized pattern that other clients use. It's also a natural place that someone would look for documentation about retries.

Thanks for the detailed review! For this PR, I would like to completely remove the retry behavior when connecting to Zoekt, as opposed to doing it only for some methods. This matches the old behavior most closely, and will let us test this as a potential source of the instability/ latency spikes we've seen on dot com. As a follow-up, we can look into re-introducing retries for some methods, for example List.

Since we're totally eliminating retries, I think it makes the most sense to not keep wrapping things in a "retry client". However, I'll definitely add a comment here highlighting that Zoekt is taking a non-default approach!

jtibshirani · 2024-01-15T23:01:03Z

I'm curious, what does this logic look like? (Has the logic that detects when a replica is unavailable been adapted for gRPC errors)? If not, that might be something you want to look into.

During a search, we check whether we can ignore an error and return partial results: https://github.com/sourcegraph/sourcegraph/blob/1e1d1db66424327e51fbcd014d201e1f6539b13e/internal/search/backend/horizontal.go#L175-L178 Good question, I'm not sure if these error checks are still valid for gRPC... will follow up on this.

jtibshirani · 2024-01-26T18:27:29Z

Following up: I filed https://github.com/sourcegraph/sourcegraph/issues/59910 so we don't forget to dig into this. I wasn't able to confirm the error-handling logic is actually still working.

Search: revert automatic retries

03b9bc0

cla-bot Bot added the cla-signed label Jan 15, 2024

jtibshirani requested review from a team January 15, 2024 19:48

ggilmore approved these changes Jan 15, 2024

View reviewed changes

Add comment about automaticRetryClient

7417e28

stefanhengl approved these changes Jan 16, 2024

View reviewed changes

keegancsmith approved these changes Jan 16, 2024

View reviewed changes

jtibshirani merged commit 914bb54 into main Jan 16, 2024

jtibshirani deleted the jtibs/retry branch January 16, 2024 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search: revert automatic retries#59603

Search: revert automatic retries#59603
jtibshirani merged 2 commits into
mainfrom
jtibs/retry

jtibshirani commented Jan 15, 2024 •

edited

Loading

Uh oh!

ggilmore left a comment

Uh oh!

ggilmore Jan 15, 2024 •

edited

Loading

Uh oh!

jtibshirani Jan 15, 2024

Uh oh!

jtibshirani commented Jan 15, 2024 •

edited

Loading

Uh oh!

jtibshirani commented Jan 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jtibshirani commented Jan 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

ggilmore left a comment

Choose a reason for hiding this comment

Uh oh!

ggilmore Jan 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtibshirani Jan 15, 2024

Choose a reason for hiding this comment

Uh oh!

jtibshirani commented Jan 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtibshirani commented Jan 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jtibshirani commented Jan 15, 2024 •

edited

Loading

ggilmore Jan 15, 2024 •

edited

Loading

jtibshirani commented Jan 15, 2024 •

edited

Loading