Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Recursive implementation of searching for correct bounds#49969

Merged
pjlast merged 14 commits into
mainfrom
pjlast/github-search-recursive
Mar 27, 2023
Merged

Recursive implementation of searching for correct bounds#49969
pjlast merged 14 commits into
mainfrom
pjlast/github-search-recursive

Conversation

@pjlast

@pjlast pjlast commented Mar 24, 2023

Copy link
Copy Markdown
Contributor

Before this PR
When a GitHub repositoryQuery encounters more than 1000 repositories in a search result, it would start refining the search window by halving the search window for the creation time of the repositories. Then, when there are less than 1000 repositories in the search results, it would return those repositories, adjust the lower bound of the search window, and start search all over again. For a GitHub Starburst sync's first 10 minutes, it would do 134 such refinement queries, and discover ~10,500 repositories.

After this PR
We do the search window refining recursively, splitting a search into two halves each time if a response contains more than 1000 results. This allows us to keep track of the window splits, and start the next search within the bounds of the split.
For example, if the bounds were from 1 January to 30 January, it would be split into two searches between 1 January to 14 January, and 15 January to 30 January. If the first result still returned too many results, it will split into two searches again: 1 January to 6 January, and 7 January to 14 January, until a search returns an acceptable amount of results.
For a GitHub Starburst sync's first 10 minutes, it does 36 such refinement queries, and discovers ~21,000 repositories.

So that's about a 2x speed increase for the first ten minutes of the sync. This does not mean it extrapolates linearly over the rest of the sync, since I don't think the creation dates of GitHub repositories are linearly distributed. But I do think it would improve even further as the sync progresses, since the compound effect of not having to start from scratch every time should add up.

Test plan

I did some local testing, original tests still passing

@cla-bot cla-bot Bot added the cla-signed label Mar 24, 2023
@pjlast pjlast marked this pull request as ready for review March 25, 2023 14:37
@pjlast pjlast requested review from a team March 25, 2023 14:37
@sourcegraph-bot

sourcegraph-bot commented Mar 25, 2023

Copy link
Copy Markdown
Contributor

Codenotify: Notifying subscribers in CODENOTIFY files for diff 826df4f...d00fcc6.

Notify File(s)
@indradhanush internal/repos/github.go
internal/repos/github_test.go
@sashaostrikov internal/repos/github.go
internal/repos/github_test.go

@sashaostrikov sashaostrikov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment thread internal/repos/github.go
}
func (q *repositoryQuery) doRecursively(ctx context.Context, results chan *githubResult) error {
// If we know that the number of repos in this query is greater than the limit, we can immediately split the query
if q.RepoCount.known && q.RepoCount.count > q.Limit && q.Created.To.Sub(q.Created.From) >= 2*time.Second {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 2 seconds an arbitrary value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well GitHub createdAt stamps are only accurate to 1 second. So if, somehow, 1000 repositories were created at the same second, we wouldn't be able to refine any further and we should stop.

I'll add it to the comment

Comment thread internal/repos/github.go Outdated
Comment on lines 927 to 946

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we move to a constant or config?

Comment thread internal/repos/github.go Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, can there be a case when there is a result within this 1-sec range?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread internal/repos/github_test.go Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mrnugget mrnugget left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left superficial code comments, trust you all on the algorithm itself.

Please add a changelog entry for this.

Comment thread internal/repos/github.go Outdated
Comment thread internal/repos/github.go Outdated
Comment thread internal/repos/github.go Outdated
Comment thread internal/repos/github.go Outdated
@pjlast pjlast force-pushed the pjlast/github-search-recursive branch from 31c697c to 79f8cce Compare March 27, 2023 11:06
@sourcegraph-bot

sourcegraph-bot commented Mar 27, 2023

Copy link
Copy Markdown
Contributor

Codenotify: Notifying subscribers in OWNERS files for diff 826df4f...d00fcc6.

No notifications.

Comment thread internal/repos/github.go
@pjlast pjlast merged commit d1aeb41 into main Mar 27, 2023
@pjlast pjlast deleted the pjlast/github-search-recursive branch March 27, 2023 13:42
Comment thread internal/repos/github.go
})
if err != nil {
return nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm late to the party, but

if err != nil {
    return nil
}

seems unintuitive. Why do we return no error if an error happens? Maybe add a comment explaining this?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants