Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

[Backport 5.1] uploads: Use streaming API for ingesting SCIP indexes#55160

Merged
coury-clark merged 1 commit into
5.1from
backport-53828-to-5.1
Jul 24, 2023
Merged

[Backport 5.1] uploads: Use streaming API for ingesting SCIP indexes#55160
coury-clark merged 1 commit into
5.1from
backport-53828-to-5.1

Conversation

@github-actions

Copy link
Copy Markdown
Contributor

This should reduce memory usage significantly, as in the common
case (no two documents share the same relative path), we will
end up processing one document at a time.

I've tested the new code with a 5.7GB Chromium index, and we're
able to process it with even 100MB of memory (at the cost of
increased GC pressure).

We need to iterate over the index twice, first to get all external symbols,
and then to process documents, as document processing requires access
to the external symbols list. This means we need the ability
to seek to the start again. I've implemented that as follows:

  • For small indexes, just read the index into a slice.
  • For large indexes, save the compressed index to a temporary file
    on disk, and rely on the GC and page cache to transparently
    drop pages earlier in the file when under memory pressure.
    I figured decompressing is cheap enough, that it doesn't make
    sense to have the extra I/O overhead of reading/writing the
    uncompressed index.

Other changes:

  • Documents are no longer sorted by relative path during iteration
    The order of iteration is still deterministic though as it matches the
    order of documents in the index.

Questions:

  • Should we add instrumentation see the memory usage at different
    stages of processing an index?
  • How do we add a memory usage test (either here or in the upstream
    scip bindings)? (internal Slack discussion)

Test plan

This reduces memory usage significantly, as in the common
case (no two documents share the same relative path), we will
end up processing one document at a time.

I've tested the new code with a 5.7GB Chromium index, and we're
able to process it with even 100MB of memory (at the cost of
increased GC pressure).

We need to iterate over the index twice, first to get all external symbols,
and then to process documents, as document processing requires access
to the external symbols list. This means we need the ability
to seek to the start again. I've implemented that as follows:

-   For small indexes, just read the index into a slice.
-   For large indexes, save the compressed index to a temporary file
    on disk, and rely on the GC and page cache to transparently
    drop pages earlier in the file when under memory pressure.
    I figured decompressing is cheap enough, that it doesn't make
    sense to have the extra I/O overhead of reading/writing the
    uncompressed index.

Other changes:

-   Documents are no longer sorted by relative path during iteration
    The order of iteration is still deterministic though as it matches the
    order of documents in the index.

(cherry picked from commit b98ca76)
@sourcegraph-bot

Copy link
Copy Markdown
Contributor

📖 Storybook live preview

@varungandhi-src

Copy link
Copy Markdown
Contributor

@coury-clark coury-clark merged commit 4abd904 into 5.1 Jul 24, 2023
@coury-clark coury-clark deleted the backport-53828-to-5.1 branch July 24, 2023 15:26
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants