sort ngrams before looking them up by keegancsmith · Pull Request #617 · sourcegraph/zoekt

keegancsmith · 2023-07-14T13:50:41Z

We believe this will improve performance of the btree lookups. We are investigating this to make it faster to rule out a shard (when freq==0). Testing locally on a large corpus we halved the time spent in IO.

Locally Sort shows up in the profiles significantly, but there are two facts mitigating that:

Locally my file page cache is primed so IO rarely is going to disk.
We likely will implement an IR for Zoekt which will amortize the Sort to once per search rather than once per shard.

Test Plan: go test ./... and performance profiling via via ./cmd/zoekt.

Part of https://github.com/sourcegraph/sourcegraph/issues/54950

Co-authored-by: @stefanhengl

We believe this will improve performance of the btree lookups. We are investigating this to make it faster to rule out a shard (when freq==0). Testing locally on a large corpus we halved the time spent in IO. Locally Sort shows up in the profiles significantly, but there are two facts mitigating that: - Locally my file page cache is primed so IO rarely is going to disk. - We likely will implement an IR for Zoekt which will amortize the Sort to once per search rather than once per shard. Test Plan: go test ./... and performance profiling via via ./cmd/zoekt. Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>

stefanhengl · 2023-07-17T07:53:29Z

 	ngramOffs := splitNGrams([]byte(query.Pattern))
+	// PERF: Sort to increase the chances adjacent checks are in the same btree
+	// bucket (which can cause disk IO).
+	slices.SortFunc(ngramOffs, func(a, b runeNgramOff) bool {


By sorting the ngrams, the slice of frequencies is now not in its natural order but reflects the sorted ngrams too. That means if we have several ngrams with the lowest frequency, before we picked the pair that was furthest apart, but with the sorting we lost that "spatial" property. Intuitively this should be rare?

yeah I think I mentioned that on our call. It should be super rare we have the same frequencies. But I can update the code to use the new offset param to tie break in this case just so we have the same behaviour as before.

Alright doing this is a bit tricky so I tried to motivate why this optimization doesn't make sense. But then I realised the case where it does make sense. If your query string contains the same trigram twice maximing the distance is good for reducing candidate documents. eg a bad string like just AAAAAAAAAAAA...AAA. So I will make this work, the code just isn't as pleasant. I first implemented something generic that was nice to read, but took a big performance hit since go wasn't able to eliminate out of bounds checks.

@stefanhengl I am gonna merge this as is and follow up with the PR to re-introduce this optimization. That way I can parallize dev and operational work :)

keegancsmith requested a review from a team July 14, 2023 13:50

keegancsmith force-pushed the k/sorted-lookups branch from 8d457c9 to 8f3532b Compare July 14, 2023 13:54

stefanhengl approved these changes Jul 17, 2023

View reviewed changes

keegancsmith merged commit 45f608f into main Jul 17, 2023

keegancsmith deleted the k/sorted-lookups branch July 17, 2023 08:32

keegancsmith mentioned this pull request Jul 17, 2023

maximise distance between ngrams #618

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sort ngrams before looking them up#617

sort ngrams before looking them up#617
keegancsmith merged 1 commit into
mainfrom
k/sorted-lookups

keegancsmith commented Jul 14, 2023

Uh oh!

stefanhengl Jul 17, 2023 •

edited

Loading

Uh oh!

keegancsmith Jul 17, 2023

Uh oh!

keegancsmith Jul 17, 2023

Uh oh!

keegancsmith Jul 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

keegancsmith commented Jul 14, 2023

Uh oh!

stefanhengl Jul 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keegancsmith Jul 17, 2023

Choose a reason for hiding this comment

Uh oh!

keegancsmith Jul 17, 2023

Choose a reason for hiding this comment

Uh oh!

keegancsmith Jul 17, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stefanhengl Jul 17, 2023 •

edited

Loading