sort ngrams before looking them up#617
Conversation
We believe this will improve performance of the btree lookups. We are investigating this to make it faster to rule out a shard (when freq==0). Testing locally on a large corpus we halved the time spent in IO. Locally Sort shows up in the profiles significantly, but there are two facts mitigating that: - Locally my file page cache is primed so IO rarely is going to disk. - We likely will implement an IR for Zoekt which will amortize the Sort to once per search rather than once per shard. Test Plan: go test ./... and performance profiling via via ./cmd/zoekt. Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
8d457c9 to
8f3532b
Compare
| ngramOffs := splitNGrams([]byte(query.Pattern)) | ||
| // PERF: Sort to increase the chances adjacent checks are in the same btree | ||
| // bucket (which can cause disk IO). | ||
| slices.SortFunc(ngramOffs, func(a, b runeNgramOff) bool { |
There was a problem hiding this comment.
By sorting the ngrams, the slice of frequencies is now not in its natural order but reflects the sorted ngrams too. That means if we have several ngrams with the lowest frequency, before we picked the pair that was furthest apart, but with the sorting we lost that "spatial" property. Intuitively this should be rare?
There was a problem hiding this comment.
yeah I think I mentioned that on our call. It should be super rare we have the same frequencies. But I can update the code to use the new offset param to tie break in this case just so we have the same behaviour as before.
There was a problem hiding this comment.
Alright doing this is a bit tricky so I tried to motivate why this optimization doesn't make sense. But then I realised the case where it does make sense. If your query string contains the same trigram twice maximing the distance is good for reducing candidate documents. eg a bad string like just AAAAAAAAAAAA...AAA. So I will make this work, the code just isn't as pleasant. I first implemented something generic that was nice to read, but took a big performance hit since go wasn't able to eliminate out of bounds checks.
There was a problem hiding this comment.
@stefanhengl I am gonna merge this as is and follow up with the PR to re-introduce this optimization. That way I can parallize dev and operational work :)
We believe this will improve performance of the btree lookups. We are investigating this to make it faster to rule out a shard (when freq==0). Testing locally on a large corpus we halved the time spent in IO.
Locally Sort shows up in the profiles significantly, but there are two facts mitigating that:
Test Plan: go test ./... and performance profiling via via ./cmd/zoekt.
Part of https://github.com/sourcegraph/sourcegraph/issues/54950
Co-authored-by: @stefanhengl