Skip to content

Improve performance of List#418

Merged
camdencheek merged 3 commits into
mainfrom
cc/list-perf
Sep 2, 2022
Merged

Improve performance of List#418
camdencheek merged 3 commits into
mainfrom
cc/list-perf

Conversation

@camdencheek

@camdencheek camdencheek commented Aug 31, 2022

Copy link
Copy Markdown
Member

This updates the List method to use the ShardRepoMaxMatchCount option
when it runs a search so that it doesn't need to search each repository
individually and sequentially. The goal here is to improve performance
for Sourcegraph queries like repo:has.file(test.go).

I don't have a large enough number of repos cloned locally to demonstrate
that this is actually faster, but given that this codepath is really only used for
queries like repo:has.file(), and that's currently performing very badly,
this seems pretty low risk. This was essentially the approach we used
before switching to using List(), except we did it client-side. I've verified
that it's not worse for my ~40 local repos.

Slack thread with context here

This updates the List method to use the ShardRepoMaxMatchCount option
when it runs a search so that it doesn't need to search each repository
individually and sequentially. The goal here is to improve performance
for Sourcegraph queries like `repo:has.file(test.go)`.
Comment thread eval.go
Comment on lines -582 to +587
// We need to run a search per repo to decide if it is included.
include = func(rle *RepoListEntry) (bool, error) {
qOneRepo := query.NewAnd(
query.NewRepoSet(rle.Repository.Name),
q)
sr, err := d.Search(ctx, qOneRepo, &SearchOptions{
ShardMaxMatchCount: 1,
TotalMaxMatchCount: 1,
})
if err != nil {
return false, err
}
return len(sr.Files) > 0, nil
sr, err := d.Search(ctx, q, &SearchOptions{
ShardRepoMaxMatchCount: 1,
})
if err != nil {
return nil, err
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of running a search per repo, this changes it to run a single ahead-of-time search that limits the results to a single match per repo. This should be much more efficient for a large number of repos.

@camdencheek camdencheek marked this pull request as ready for review August 31, 2022 22:14
@camdencheek camdencheek requested a review from a team August 31, 2022 22:14

@rvantonder rvantonder left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very~~~ looking forward to seeing this change live. My review is just a stamp, so probably wait on @sourcegraph/search-core to take a look.

Comment thread eval.go Outdated
Comment thread eval.go
Comment thread eval.go
}

foundRepos := make(map[uint32]struct{}, len(sr.Files))
for _, file := range sr.Files {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of creating this intermediate map, can't we somehow go from sr.Files entry straight to the repo metadata?

@camdencheek camdencheek Sep 1, 2022

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, not without changing behavior.

RepoListEntry has fields on it (Stats, IndexMetadata) that are not available on the file match.

MinimalRepoListEntry currently returns a list of all branches. A file match only contains the list of matching branches.

If d.repoListEntry is reliably sorted, we could sort the file matches by the same key and do a linear merge without any additional allocations, but I couldn't find any information on whether d.repoListEntry is sorted.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah relying on sorted order is dangerous. On more thought this code path is fine, given it only runs when the List query isn't the const true. LGTM

@camdencheek camdencheek merged commit d1964a3 into main Sep 2, 2022
@camdencheek camdencheek deleted the cc/list-perf branch September 2, 2022 14:21
nataliechen1 pushed a commit to nataliechen1/zoekt that referenced this pull request Oct 12, 2022
This updates the List method to use the ShardRepoMaxMatchCount option
when it runs a search so that it doesn't need to search each repository
individually and sequentially. The goal here is to improve performance
for Sourcegraph queries like `repo:has.file(test.go)`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants