[Backport 5.1] uploads: Use streaming API for ingesting SCIP indexes by github-actions[bot] · Pull Request #55160 · sourcegraph/sourcegraph-public-snapshot

github-actions · 2023-07-20T14:18:02Z

This should reduce memory usage significantly, as in the common
case (no two documents share the same relative path), we will
end up processing one document at a time.

I've tested the new code with a 5.7GB Chromium index, and we're
able to process it with even 100MB of memory (at the cost of
increased GC pressure).

We need to iterate over the index twice, first to get all external symbols,
and then to process documents, as document processing requires access
to the external symbols list. This means we need the ability
to seek to the start again. I've implemented that as follows:

For small indexes, just read the index into a slice.
For large indexes, save the compressed index to a temporary file
on disk, and rely on the GC and page cache to transparently
drop pages earlier in the file when under memory pressure.
I figured decompressing is cheap enough, that it doesn't make
sense to have the extra I/O overhead of reading/writing the
uncompressed index.

Other changes:

Documents are no longer sorted by relative path during iteration
The order of iteration is still deterministic though as it matches the
order of documents in the index.

Questions:

Should we add instrumentation see the memory usage at different
stages of processing an index?
How do we add a memory usage test (either here or in the upstream
scip bindings)? (internal Slack discussion)

Test plan

Update existing tests
Add low memory usage test -- added upstream. bindings/go: Add memory usage test for streaming parser scip-code/scip#181
Backport b98ca76 from uploads: Use streaming API for ingesting SCIP indexes #53828

This reduces memory usage significantly, as in the common case (no two documents share the same relative path), we will end up processing one document at a time. I've tested the new code with a 5.7GB Chromium index, and we're able to process it with even 100MB of memory (at the cost of increased GC pressure). We need to iterate over the index twice, first to get all external symbols, and then to process documents, as document processing requires access to the external symbols list. This means we need the ability to seek to the start again. I've implemented that as follows: - For small indexes, just read the index into a slice. - For large indexes, save the compressed index to a temporary file on disk, and rely on the GC and page cache to transparently drop pages earlier in the file when under memory pressure. I figured decompressing is cheap enough, that it doesn't make sense to have the extra I/O overhead of reading/writing the uncompressed index. Other changes: - Documents are no longer sorted by relative path during iteration The order of iteration is still deterministic though as it matches the order of documents in the index. (cherry picked from commit b98ca76)

sourcegraph-bot · 2023-07-20T14:27:59Z

📖 Storybook live preview

varungandhi-src · 2023-07-24T03:38:05Z

Original PR: https://github.com/sourcegraph/sourcegraph/pull/53828

github-actions Bot requested a review from varungandhi-src July 20, 2023 14:18

github-actions Bot added cla-signed backports backported-to-5.1 labels Jul 20, 2023

varungandhi-src approved these changes Jul 21, 2023

View reviewed changes

varungandhi-src requested a review from coury-clark July 24, 2023 03:30

coury-clark merged commit 4abd904 into 5.1 Jul 24, 2023

coury-clark deleted the backport-53828-to-5.1 branch July 24, 2023 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backport 5.1] uploads: Use streaming API for ingesting SCIP indexes#55160

[Backport 5.1] uploads: Use streaming API for ingesting SCIP indexes#55160
coury-clark merged 1 commit into
5.1from
backport-53828-to-5.1

github-actions Bot commented Jul 20, 2023

Uh oh!

sourcegraph-bot commented Jul 20, 2023

Uh oh!

varungandhi-src commented Jul 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

github-actions Bot commented Jul 20, 2023

Test plan

Uh oh!

sourcegraph-bot commented Jul 20, 2023

Uh oh!

varungandhi-src commented Jul 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants