Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Search: revert automatic retries#59603

Merged
jtibshirani merged 2 commits into
mainfrom
jtibs/retry
Jan 16, 2024
Merged

Search: revert automatic retries#59603
jtibshirani merged 2 commits into
mainfrom
jtibs/retry

Conversation

@jtibshirani

@jtibshirani jtibshirani commented Jan 15, 2024

Copy link
Copy Markdown
Contributor

This reverts #59133, which added automatic retries for the gRPC client that
Zoekt uses.

As noted in https://github.com/sourcegraph/sourcegraph/pull/59133#issuecomment-1867109359, this is a change in behavior from before. We
suspect that when Zoekt is overloaded, it can appear as "unavailable", which
causes us to automatically retry searches. This in turn puts more load on
Zoekt, keeping it in an overloaded state. In general, Zoekt has special
handling for when replicas are unavailable, and we don't want to use this
general default.

Test plan

Covered by existing tests

@cla-bot cla-bot Bot added the cla-signed label Jan 15, 2024
@jtibshirani jtibshirani requested review from a team January 15, 2024 19:48

@ggilmore ggilmore left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, Zoekt has special
handling for when replicas are unavailable, and we don't want to use this
general default.

I'm curious, what does this logic look like? (Has the logic that detects when a replica is unavailable been adapted for gRPC errors)? If not, that might be something you want to look into.

}

func (a *automaticRetryClient) Search(ctx context.Context, in *proto.SearchRequest, opts ...grpc.CallOption) (*proto.SearchResponse, error) {
opts = append(defaults.RetryPolicy, opts...)

@ggilmore ggilmore Jan 15, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of simply deleting the automaticRetryClient, I think that it might be better to:

  1. "disable" automatic retries by making each of these methods a pass-through method to the underlying client (not adding in the defaults.RetryPolicy call options)

    func (a *automaticRetryClient) Search(ctx context.Context, in *proto.SearchRequest, opts ...grpc.CallOption) (*proto.SearchResponse, error) {
    	return a.base.Search(ctx, in, opts...)
    }

    You can see how I do this for some gitserver methods here: https://github.com/sourcegraph/sourcegraph/blob/95a44f053d003699dc6dcbca999671c3e9638307/internal/gitserver/retry.go#L22-L39

  2. Document for each method why we aren't using the default retry policy. In this case, it's not because the methods aren't idempotent, but instead because there is "higher level" retry policy in the application.

    You can see a similar comment that Ieft for gitserver's CreateCommitFromPatchBinary method to explain why I wasn't using the automatic retry support with it: https://github.com/sourcegraph/sourcegraph/blob/95a44f053d003699dc6dcbca999671c3e9638307/internal/gitserver/retry.go#L35-L39

    func (r *automaticRetryClient) CreateCommitFromPatchBinary(ctx context.Context, opts ...grpc.CallOption) (proto.GitserverService_CreateCommitFromPatchBinaryClient, error) {
    	// CreateCommitFromPatchBinary isn't idempotent. It also is a client-streaming method, which is currently unsupported by our automatic retry logic.
    	// The caller is responsible for implementing their own retry semantics for this method.
    	return r.base.CreateCommitFromPatchBinary(ctx, opts...)
    }

I believe this approach is more discoverable since it uses the same standardized pattern that other clients use. It's also a natural place that someone would look for documentation about retries.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed review! For this PR, I would like to completely remove the retry behavior when connecting to Zoekt, as opposed to doing it only for some methods. This matches the old behavior most closely, and will let us test this as a potential source of the instability/ latency spikes we've seen on dot com. As a follow-up, we can look into re-introducing retries for some methods, for example List.

Since we're totally eliminating retries, I think it makes the most sense to not keep wrapping things in a "retry client". However, I'll definitely add a comment here highlighting that Zoekt is taking a non-default approach!

@jtibshirani

jtibshirani commented Jan 15, 2024

Copy link
Copy Markdown
Contributor Author

I'm curious, what does this logic look like? (Has the logic that detects when a replica is unavailable been adapted for gRPC errors)? If not, that might be something you want to look into.

During a search, we check whether we can ignore an error and return partial results: https://github.com/sourcegraph/sourcegraph/blob/1e1d1db66424327e51fbcd014d201e1f6539b13e/internal/search/backend/horizontal.go#L175-L178 Good question, I'm not sure if these error checks are still valid for gRPC... will follow up on this.

@jtibshirani jtibshirani merged commit 914bb54 into main Jan 16, 2024
@jtibshirani jtibshirani deleted the jtibs/retry branch January 16, 2024 16:54
@jtibshirani

Copy link
Copy Markdown
Contributor Author

Following up: I filed https://github.com/sourcegraph/sourcegraph/issues/59910 so we don't forget to dig into this. I wasn't able to confirm the error-handling logic is actually still working.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants