Skip to content

sort ngrams before looking them up#617

Merged
keegancsmith merged 1 commit into
mainfrom
k/sorted-lookups
Jul 17, 2023
Merged

sort ngrams before looking them up#617
keegancsmith merged 1 commit into
mainfrom
k/sorted-lookups

Conversation

@keegancsmith

Copy link
Copy Markdown
Member

We believe this will improve performance of the btree lookups. We are investigating this to make it faster to rule out a shard (when freq==0). Testing locally on a large corpus we halved the time spent in IO.

Locally Sort shows up in the profiles significantly, but there are two facts mitigating that:

  • Locally my file page cache is primed so IO rarely is going to disk.
  • We likely will implement an IR for Zoekt which will amortize the Sort to once per search rather than once per shard.

Test Plan: go test ./... and performance profiling via via ./cmd/zoekt.

Part of https://github.com/sourcegraph/sourcegraph/issues/54950

Co-authored-by: @stefanhengl

@keegancsmith keegancsmith requested a review from a team July 14, 2023 13:50
We believe this will improve performance of the btree lookups. We are
investigating this to make it faster to rule out a shard (when freq==0).
Testing locally on a large corpus we halved the time spent in IO.

Locally Sort shows up in the profiles significantly, but there are two
facts mitigating that:
- Locally my file page cache is primed so IO rarely is going to disk.
- We likely will implement an IR for Zoekt which will amortize the Sort
  to once per search rather than once per shard.

Test Plan: go test ./... and performance profiling via via ./cmd/zoekt.

Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
Comment thread indexdata.go
ngramOffs := splitNGrams([]byte(query.Pattern))
// PERF: Sort to increase the chances adjacent checks are in the same btree
// bucket (which can cause disk IO).
slices.SortFunc(ngramOffs, func(a, b runeNgramOff) bool {

@stefanhengl stefanhengl Jul 17, 2023

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By sorting the ngrams, the slice of frequencies is now not in its natural order but reflects the sorted ngrams too. That means if we have several ngrams with the lowest frequency, before we picked the pair that was furthest apart, but with the sorting we lost that "spatial" property. Intuitively this should be rare?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think I mentioned that on our call. It should be super rare we have the same frequencies. But I can update the code to use the new offset param to tie break in this case just so we have the same behaviour as before.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright doing this is a bit tricky so I tried to motivate why this optimization doesn't make sense. But then I realised the case where it does make sense. If your query string contains the same trigram twice maximing the distance is good for reducing candidate documents. eg a bad string like just AAAAAAAAAAAA...AAA. So I will make this work, the code just isn't as pleasant. I first implemented something generic that was nice to read, but took a big performance hit since go wasn't able to eliminate out of bounds checks.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefanhengl I am gonna merge this as is and follow up with the PR to re-introduce this optimization. That way I can parallize dev and operational work :)

@keegancsmith keegancsmith merged commit 45f608f into main Jul 17, 2023
@keegancsmith keegancsmith deleted the k/sorted-lookups branch July 17, 2023 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants