Recursive implementation of searching for correct bounds by pjlast · Pull Request #49969 · sourcegraph/sourcegraph-public-snapshot

pjlast · 2023-03-24T16:26:58Z

Before this PR
When a GitHub repositoryQuery encounters more than 1000 repositories in a search result, it would start refining the search window by halving the search window for the creation time of the repositories. Then, when there are less than 1000 repositories in the search results, it would return those repositories, adjust the lower bound of the search window, and start search all over again. For a GitHub Starburst sync's first 10 minutes, it would do 134 such refinement queries, and discover ~10,500 repositories.

After this PR
We do the search window refining recursively, splitting a search into two halves each time if a response contains more than 1000 results. This allows us to keep track of the window splits, and start the next search within the bounds of the split.
For example, if the bounds were from 1 January to 30 January, it would be split into two searches between 1 January to 14 January, and 15 January to 30 January. If the first result still returned too many results, it will split into two searches again: 1 January to 6 January, and 7 January to 14 January, until a search returns an acceptable amount of results.
For a GitHub Starburst sync's first 10 minutes, it does 36 such refinement queries, and discovers ~21,000 repositories.

So that's about a 2x speed increase for the first ten minutes of the sync. This does not mean it extrapolates linearly over the rest of the sync, since I don't think the creation dates of GitHub repositories are linearly distributed. But I do think it would improve even further as the sync progresses, since the compound effect of not having to start from scratch every time should add up.

Test plan

I did some local testing, original tests still passing

sourcegraph-bot · 2023-03-25T14:39:25Z

Codenotify: Notifying subscribers in CODENOTIFY files for diff 826df4f...d00fcc6.

Notify	File(s)
@indradhanush	internal/repos/github.go internal/repos/github_test.go
@sashaostrikov	internal/repos/github.go internal/repos/github_test.go

sashaostrikov

LGTM!

sashaostrikov · 2023-03-27T05:39:08Z

-		}
+func (q *repositoryQuery) doRecursively(ctx context.Context, results chan *githubResult) error {
+	// If we know that the number of repos in this query is greater than the limit, we can immediately split the query
+	if q.RepoCount.known && q.RepoCount.count > q.Limit && q.Created.To.Sub(q.Created.From) >= 2*time.Second {


Is 2 seconds an arbitrary value?

Well GitHub createdAt stamps are only accurate to 1 second. So if, somehow, 1000 repositories were created at the same second, we wouldn't be able to refine any further and we should stop.

I'll add it to the comment

erzhtor · 2023-03-27T08:38:01Z

Shall we move to a constant or config?

erzhtor · 2023-03-27T08:44:10Z

Curious, can there be a case when there is a result within this 1-sec range?

Nope, refer to this comment:
https://github.com/sourcegraph/sourcegraph/pull/49969/files#diff-d06cc97b1197326e45b75efefcffb4052ababb34bb85230d5c9c8cf51afb51bfR1020-R1021

GitHub createdAt timestamps are only accurate to 1 second.

erzhtor · 2023-03-27T08:44:59Z

mrnugget

Left superficial code comments, trust you all on the algorithm itself.

Please add a changelog entry for this.

…essary requests

sourcegraph-bot · 2023-03-27T11:08:14Z

Codenotify: Notifying subscribers in OWNERS files for diff 826df4f...d00fcc6.

No notifications.

vdavid · 2023-03-27T15:26:43Z

+	})
+	if err != nil {
+		return nil
+	}


I'm late to the party, but

if err != nil { return nil }

seems unintuitive. Why do we return no error if an error happens? Maybe add a comment explaining this?

cla-bot Bot added the cla-signed label Mar 24, 2023

github-actions Bot added the team/iam label Mar 24, 2023

pjlast marked this pull request as ready for review March 25, 2023 14:37

pjlast requested review from a team March 25, 2023 14:37

sashaostrikov approved these changes Mar 27, 2023

View reviewed changes

erzhtor approved these changes Mar 27, 2023

View reviewed changes

mrnugget approved these changes Mar 27, 2023

View reviewed changes

Comment thread internal/repos/github.go Outdated

Comment thread internal/repos/github.go Outdated

Comment thread internal/repos/github.go Outdated

Comment thread internal/repos/github.go Outdated

pjlast added 13 commits March 27, 2023 13:05

Recursive implementation of searching for correct bounds

c5e844b

Clean up recursive approach, and calculate repo counts to avoid unnec…

efc5ff7

…essary requests

Rather reuse the 'check count' query and save a request

37c574a

Add exit condition on the refinement window size

14b6a8a

Original DoSingleRequest only fetched the first page. Do the same

2dc59bf

Explain recursive process with comment

b716768

Strip GitHub date from query

2a80b31

Expand on minimum timestamp comment

13aaa48

If date strings are zero, set to respective minimum and maximums

70d7d3c

Use correct beginning/end-of-day bounds

689ed7c

Use correct imports and error formatting

c60d744

PR comments

3d6aaee

Changelog entry

79f8cce

pjlast force-pushed the pjlast/github-search-recursive branch from 31c697c to 79f8cce Compare March 27, 2023 11:06

mrnugget reviewed Mar 27, 2023

View reviewed changes

Comment thread internal/repos/github.go

Move date range conditions into newRepositoryQuery

d00fcc6

pjlast merged commit d1aeb41 into main Mar 27, 2023

pjlast deleted the pjlast/github-search-recursive branch March 27, 2023 13:42

vdavid reviewed Mar 27, 2023

View reviewed changes

Conversation

pjlast commented Mar 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

sourcegraph-bot commented Mar 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sashaostrikov left a comment

Choose a reason for hiding this comment

Uh oh!

sashaostrikov Mar 27, 2023

Choose a reason for hiding this comment

Uh oh!

pjlast Mar 27, 2023

Choose a reason for hiding this comment

Uh oh!

erzhtor Mar 27, 2023

Choose a reason for hiding this comment

Uh oh!

erzhtor Mar 27, 2023

Choose a reason for hiding this comment

Uh oh!

pjlast Mar 27, 2023

Choose a reason for hiding this comment

Uh oh!

erzhtor Mar 27, 2023

Choose a reason for hiding this comment

Uh oh!

mrnugget left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sourcegraph-bot commented Mar 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

vdavid Mar 27, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pjlast commented Mar 24, 2023 •

edited

Loading

sourcegraph-bot commented Mar 25, 2023 •

edited

Loading

sourcegraph-bot commented Mar 27, 2023 •

edited

Loading