Improve performance of List#418
Conversation
This updates the List method to use the ShardRepoMaxMatchCount option when it runs a search so that it doesn't need to search each repository individually and sequentially. The goal here is to improve performance for Sourcegraph queries like `repo:has.file(test.go)`.
| // We need to run a search per repo to decide if it is included. | ||
| include = func(rle *RepoListEntry) (bool, error) { | ||
| qOneRepo := query.NewAnd( | ||
| query.NewRepoSet(rle.Repository.Name), | ||
| q) | ||
| sr, err := d.Search(ctx, qOneRepo, &SearchOptions{ | ||
| ShardMaxMatchCount: 1, | ||
| TotalMaxMatchCount: 1, | ||
| }) | ||
| if err != nil { | ||
| return false, err | ||
| } | ||
| return len(sr.Files) > 0, nil | ||
| sr, err := d.Search(ctx, q, &SearchOptions{ | ||
| ShardRepoMaxMatchCount: 1, | ||
| }) | ||
| if err != nil { | ||
| return nil, err | ||
| } |
There was a problem hiding this comment.
Instead of running a search per repo, this changes it to run a single ahead-of-time search that limits the results to a single match per repo. This should be much more efficient for a large number of repos.
rvantonder
left a comment
There was a problem hiding this comment.
I am very~~~ looking forward to seeing this change live. My review is just a stamp, so probably wait on @sourcegraph/search-core to take a look.
| } | ||
|
|
||
| foundRepos := make(map[uint32]struct{}, len(sr.Files)) | ||
| for _, file := range sr.Files { |
There was a problem hiding this comment.
instead of creating this intermediate map, can't we somehow go from sr.Files entry straight to the repo metadata?
There was a problem hiding this comment.
AFAICT, not without changing behavior.
RepoListEntry has fields on it (Stats, IndexMetadata) that are not available on the file match.
MinimalRepoListEntry currently returns a list of all branches. A file match only contains the list of matching branches.
If d.repoListEntry is reliably sorted, we could sort the file matches by the same key and do a linear merge without any additional allocations, but I couldn't find any information on whether d.repoListEntry is sorted.
There was a problem hiding this comment.
yeah relying on sorted order is dangerous. On more thought this code path is fine, given it only runs when the List query isn't the const true. LGTM
This updates the List method to use the ShardRepoMaxMatchCount option when it runs a search so that it doesn't need to search each repository individually and sequentially. The goal here is to improve performance for Sourcegraph queries like `repo:has.file(test.go)`.
This updates the List method to use the
ShardRepoMaxMatchCountoptionwhen it runs a search so that it doesn't need to search each repository
individually and sequentially. The goal here is to improve performance
for Sourcegraph queries like
repo:has.file(test.go).I don't have a large enough number of repos cloned locally to demonstrate
that this is actually faster, but given that this codepath is really only used for
queries like
repo:has.file(), and that's currently performing very badly,this seems pretty low risk. This was essentially the approach we used
before switching to using
List(), except we did it client-side. I've verifiedthat it's not worse for my ~40 local repos.
Slack thread with context here