Improve performance of List by camdencheek · Pull Request #418 · sourcegraph/zoekt

camdencheek · 2022-08-31T20:36:24Z

This updates the List method to use the ShardRepoMaxMatchCount option
when it runs a search so that it doesn't need to search each repository
individually and sequentially. The goal here is to improve performance
for Sourcegraph queries like repo:has.file(test.go).

I don't have a large enough number of repos cloned locally to demonstrate
that this is actually faster, but given that this codepath is really only used for
queries like repo:has.file(), and that's currently performing very badly,
this seems pretty low risk. This was essentially the approach we used
before switching to using List(), except we did it client-side. I've verified
that it's not worse for my ~40 local repos.

Slack thread with context here

This updates the List method to use the ShardRepoMaxMatchCount option when it runs a search so that it doesn't need to search each repository individually and sequentially. The goal here is to improve performance for Sourcegraph queries like `repo:has.file(test.go)`.

camdencheek · 2022-08-31T20:40:15Z

-		// We need to run a search per repo to decide if it is included.
-		include = func(rle *RepoListEntry) (bool, error) {
-			qOneRepo := query.NewAnd(
-				query.NewRepoSet(rle.Repository.Name),
-				q)
-			sr, err := d.Search(ctx, qOneRepo, &SearchOptions{
-				ShardMaxMatchCount: 1,
-				TotalMaxMatchCount: 1,
-			})
-			if err != nil {
-				return false, err
-			}
-			return len(sr.Files) > 0, nil
+		sr, err := d.Search(ctx, q, &SearchOptions{
+			ShardRepoMaxMatchCount: 1,
+		})
+		if err != nil {
+			return nil, err
+		}


Instead of running a search per repo, this changes it to run a single ahead-of-time search that limits the results to a single match per repo. This should be much more efficient for a large number of repos.

rvantonder

I am very~~~ looking forward to seeing this change live. My review is just a stamp, so probably wait on @sourcegraph/search-core to take a look.

keegancsmith · 2022-09-01T15:48:59Z

+		}
+
+		foundRepos := make(map[uint32]struct{}, len(sr.Files))
+		for _, file := range sr.Files {


instead of creating this intermediate map, can't we somehow go from sr.Files entry straight to the repo metadata?

AFAICT, not without changing behavior.

RepoListEntry has fields on it (Stats, IndexMetadata) that are not available on the file match.

MinimalRepoListEntry currently returns a list of all branches. A file match only contains the list of matching branches.

If d.repoListEntry is reliably sorted, we could sort the file matches by the same key and do a linear merge without any additional allocations, but I couldn't find any information on whether d.repoListEntry is sorted.

yeah relying on sorted order is dangerous. On more thought this code path is fine, given it only runs when the List query isn't the const true. LGTM

This updates the List method to use the ShardRepoMaxMatchCount option when it runs a search so that it doesn't need to search each repository individually and sequentially. The goal here is to improve performance for Sourcegraph queries like `repo:has.file(test.go)`.

camdencheek added 2 commits August 31, 2022 14:34

minimize diff

4a6dab8

camdencheek commented Aug 31, 2022

View reviewed changes

camdencheek marked this pull request as ready for review August 31, 2022 22:14

camdencheek requested a review from a team August 31, 2022 22:14

rvantonder approved these changes Sep 1, 2022

View reviewed changes

keegancsmith requested changes Sep 1, 2022

View reviewed changes

update map to use repository name

c82c97e

camdencheek requested a review from keegancsmith September 2, 2022 13:56

keegancsmith approved these changes Sep 2, 2022

View reviewed changes

camdencheek merged commit d1964a3 into main Sep 2, 2022

camdencheek deleted the cc/list-perf branch September 2, 2022 14:21

camdencheek mentioned this pull request Sep 2, 2022

gomod: update zoekt for improved List performance sourcegraph/sourcegraph-public-snapshot#41263

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of List#418

Improve performance of List#418
camdencheek merged 3 commits into
mainfrom
cc/list-perf

camdencheek commented Aug 31, 2022 •

edited

Loading

Uh oh!

camdencheek Aug 31, 2022

Uh oh!

rvantonder left a comment

Uh oh!

Uh oh!

Uh oh!

keegancsmith Sep 1, 2022

Uh oh!

camdencheek Sep 1, 2022 •

edited

Loading

Uh oh!

keegancsmith Sep 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

camdencheek commented Aug 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

camdencheek Aug 31, 2022

Choose a reason for hiding this comment

Uh oh!

rvantonder left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

keegancsmith Sep 1, 2022

Choose a reason for hiding this comment

Uh oh!

camdencheek Sep 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keegancsmith Sep 2, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

camdencheek commented Aug 31, 2022 •

edited

Loading

camdencheek Sep 1, 2022 •

edited

Loading