Recursive implementation of searching for correct bounds#49969
Conversation
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff 826df4f...d00fcc6.
|
| } | ||
| func (q *repositoryQuery) doRecursively(ctx context.Context, results chan *githubResult) error { | ||
| // If we know that the number of repos in this query is greater than the limit, we can immediately split the query | ||
| if q.RepoCount.known && q.RepoCount.count > q.Limit && q.Created.To.Sub(q.Created.From) >= 2*time.Second { |
There was a problem hiding this comment.
Is 2 seconds an arbitrary value?
There was a problem hiding this comment.
Well GitHub createdAt stamps are only accurate to 1 second. So if, somehow, 1000 repositories were created at the same second, we wouldn't be able to refine any further and we should stop.
I'll add it to the comment
There was a problem hiding this comment.
Shall we move to a constant or config?
There was a problem hiding this comment.
Curious, can there be a case when there is a result within this 1-sec range?
There was a problem hiding this comment.
Nope, refer to this comment:
https://github.com/sourcegraph/sourcegraph/pull/49969/files#diff-d06cc97b1197326e45b75efefcffb4052ababb34bb85230d5c9c8cf51afb51bfR1020-R1021
GitHub createdAt timestamps are only accurate to 1 second.
mrnugget
left a comment
There was a problem hiding this comment.
Left superficial code comments, trust you all on the algorithm itself.
Please add a changelog entry for this.
31c697c to
79f8cce
Compare
|
Codenotify: Notifying subscribers in OWNERS files for diff 826df4f...d00fcc6. No notifications. |
| }) | ||
| if err != nil { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
I'm late to the party, but
if err != nil {
return nil
}
seems unintuitive. Why do we return no error if an error happens? Maybe add a comment explaining this?
Before this PR
When a GitHub
repositoryQueryencounters more than 1000 repositories in a search result, it would start refining the search window by halving the search window for the creation time of the repositories. Then, when there are less than 1000 repositories in the search results, it would return those repositories, adjust the lower bound of the search window, and start search all over again. For a GitHub Starburst sync's first 10 minutes, it would do 134 such refinement queries, and discover ~10,500 repositories.After this PR
We do the search window refining recursively, splitting a search into two halves each time if a response contains more than 1000 results. This allows us to keep track of the window splits, and start the next search within the bounds of the split.
For example, if the bounds were from 1 January to 30 January, it would be split into two searches between 1 January to 14 January, and 15 January to 30 January. If the first result still returned too many results, it will split into two searches again: 1 January to 6 January, and 7 January to 14 January, until a search returns an acceptable amount of results.
For a GitHub Starburst sync's first 10 minutes, it does 36 such refinement queries, and discovers ~21,000 repositories.
So that's about a 2x speed increase for the first ten minutes of the sync. This does not mean it extrapolates linearly over the rest of the sync, since I don't think the creation dates of GitHub repositories are linearly distributed. But I do think it would improve even further as the sync progresses, since the compound effect of not having to start from scratch every time should add up.
Test plan
I did some local testing, original tests still passing