This repository was archived by the owner on Sep 30, 2024. It is now read-only.
[Backport 5.1] uploads: Use streaming API for ingesting SCIP indexes#55160
Merged
Conversation
This reduces memory usage significantly, as in the common
case (no two documents share the same relative path), we will
end up processing one document at a time.
I've tested the new code with a 5.7GB Chromium index, and we're
able to process it with even 100MB of memory (at the cost of
increased GC pressure).
We need to iterate over the index twice, first to get all external symbols,
and then to process documents, as document processing requires access
to the external symbols list. This means we need the ability
to seek to the start again. I've implemented that as follows:
- For small indexes, just read the index into a slice.
- For large indexes, save the compressed index to a temporary file
on disk, and rely on the GC and page cache to transparently
drop pages earlier in the file when under memory pressure.
I figured decompressing is cheap enough, that it doesn't make
sense to have the extra I/O overhead of reading/writing the
uncompressed index.
Other changes:
- Documents are no longer sorted by relative path during iteration
The order of iteration is still deterministic though as it matches the
order of documents in the index.
(cherry picked from commit b98ca76)
Contributor
varungandhi-src
approved these changes
Jul 21, 2023
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This should reduce memory usage significantly, as in the common
case (no two documents share the same relative path), we will
end up processing one document at a time.
I've tested the new code with a 5.7GB Chromium index, and we're
able to process it with even 100MB of memory (at the cost of
increased GC pressure).
We need to iterate over the index twice, first to get all external symbols,
and then to process documents, as document processing requires access
to the external symbols list. This means we need the ability
to seek to the start again. I've implemented that as follows:
on disk, and rely on the GC and page cache to transparently
drop pages earlier in the file when under memory pressure.
I figured decompressing is cheap enough, that it doesn't make
sense to have the extra I/O overhead of reading/writing the
uncompressed index.
Other changes:
The order of iteration is still deterministic though as it matches the
order of documents in the index.
Questions:
stages of processing an index?
scip bindings)? (internal Slack discussion)
Test plan
Backport b98ca76 from uploads: Use streaming API for ingesting SCIP indexes #53828