Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

uploads: Use streaming API for ingesting SCIP indexes#53828

Merged
varungandhi-src merged 18 commits into
mainfrom
vg/stream-indexes
Jul 20, 2023
Merged

uploads: Use streaming API for ingesting SCIP indexes#53828
varungandhi-src merged 18 commits into
mainfrom
vg/stream-indexes

Conversation

@varungandhi-src

@varungandhi-src varungandhi-src commented Jun 21, 2023

Copy link
Copy Markdown
Contributor

This should reduce memory usage significantly, as in the common
case (no two documents share the same relative path), we will
end up processing one document at a time.

I've tested the new code with a 5.7GB Chromium index, and we're
able to process it with even 100MB of memory (at the cost of
increased GC pressure).

We need to iterate over the index twice, first to get all external symbols,
and then to process documents, as document processing requires access
to the external symbols list. This means we need the ability
to seek to the start again. I've implemented that as follows:

  • For small indexes, just read the index into a slice.
  • For large indexes, save the compressed index to a temporary file
    on disk, and rely on the GC and page cache to transparently
    drop pages earlier in the file when under memory pressure.
    I figured decompressing is cheap enough, that it doesn't make
    sense to have the extra I/O overhead of reading/writing the
    uncompressed index.

Other changes:

  • Documents are no longer sorted by relative path during iteration
    The order of iteration is still deterministic though as it matches the
    order of documents in the index.

Questions:

  • Should we add instrumentation see the memory usage at different
    stages of processing an index?
  • How do we add a memory usage test (either here or in the upstream
    scip bindings)? (internal Slack discussion)

Test plan

@cla-bot cla-bot Bot added the cla-signed label Jun 21, 2023
Comment thread enterprise/internal/codeintel/uploads/internal/background/processor/scip.go Outdated
Comment thread enterprise/internal/codeintel/uploads/internal/background/processor/scip.go Outdated
},
}
if err := secondPassVisitor.ParseStreaming(rr); err != nil {
// FIXME: How do we handle the error here properly?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goroutine will have to be in a monitored waitgroup.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point me to some example code? Also, I didn't quite understand why there's a separate goroutine being spawned for package references below. What's up with that?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conc.WaitGroup is a cool thing to use.

As for the flow of this function:

  • We make four channels we asynchronously write to and return from this function. We write values into the channels in a way that's symmetric to how we consume them - here we share knowledge that all documents are written before any of the package metadata (as we need to process all documents for this data to be completete).
  • The outer goroutine will write to the documents channel, and do some synchronous work to populate some shared data.
  • Once all documents are written, the documents channel is closed and the inner goroutine is started. This goroutine will send all package data on one of two channels, closing both once the shared map has been completely iterated.

This isn't necessarily the best shape to keep this code in. We don't get so much from the second goroutine sending channels instead of just the constructed map or a pair of slices (the problem is that we only have this data after we've asynchronously produced all documents).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replaced the errors here with a logging statement, since we don't expect this to be hit in practice... We can clean up the error handling further when simplifying the control flow.

Comment thread enterprise/internal/codeintel/uploads/internal/background/processor/scip.go Outdated
Comment thread enterprise/internal/codeintel/uploads/internal/background/processor/scip.go Outdated
@varungandhi-src varungandhi-src marked this pull request as ready for review June 26, 2023 03:52
@varungandhi-src varungandhi-src changed the title WIP: Use streaming API for parsing indexes uploads: Use streaming API for ingesting SCIP indexes Jun 26, 2023
@sourcegraph-bot

sourcegraph-bot commented Jun 26, 2023

Copy link
Copy Markdown
Contributor

Codenotify: Notifying subscribers in CODENOTIFY files for diff 9fdf5de...5f6b072.

Notify File(s)
@Strum355 internal/codeintel/uploads/internal/background/processor/BUILD.bazel
internal/codeintel/uploads/internal/background/processor/job_worker_handler.go
internal/codeintel/uploads/internal/background/processor/scip.go
internal/codeintel/uploads/internal/background/processor/scip_test.go
internal/codeintel/uploads/shared/types.go
@efritz internal/codeintel/uploads/internal/background/processor/BUILD.bazel
internal/codeintel/uploads/internal/background/processor/job_worker_handler.go
internal/codeintel/uploads/internal/background/processor/scip.go
internal/codeintel/uploads/internal/background/processor/scip_test.go
internal/codeintel/uploads/shared/types.go

@varungandhi-src varungandhi-src enabled auto-merge (squash) July 6, 2023 05:57
Comment thread internal/codeintel/uploads/internal/background/processor/scip.go
Comment thread internal/codeintel/uploads/internal/background/processor/job_worker_handler.go Outdated
Comment thread internal/codeintel/uploads/internal/background/processor/job_worker_handler.go Outdated
Comment thread internal/codeintel/uploads/internal/background/processor/job_worker_handler.go Outdated
Comment thread internal/codeintel/uploads/internal/background/processor/job_worker_handler.go Outdated
@varungandhi-src varungandhi-src requested a review from efritz July 20, 2023 09:38
@varungandhi-src varungandhi-src merged commit b98ca76 into main Jul 20, 2023
@varungandhi-src varungandhi-src deleted the vg/stream-indexes branch July 20, 2023 14:16
github-actions Bot pushed a commit that referenced this pull request Jul 20, 2023
This reduces memory usage significantly, as in the common
case (no two documents share the same relative path), we will
end up processing one document at a time.

I've tested the new code with a 5.7GB Chromium index, and we're
able to process it with even 100MB of memory (at the cost of
increased GC pressure).

We need to iterate over the index twice, first to get all external symbols,
and then to process documents, as document processing requires access
to the external symbols list. This means we need the ability
to seek to the start again. I've implemented that as follows:

-   For small indexes, just read the index into a slice.
-   For large indexes, save the compressed index to a temporary file
    on disk, and rely on the GC and page cache to transparently
    drop pages earlier in the file when under memory pressure.
    I figured decompressing is cheap enough, that it doesn't make
    sense to have the extra I/O overhead of reading/writing the
    uncompressed index.

Other changes:

-   Documents are no longer sorted by relative path during iteration
    The order of iteration is still deterministic though as it matches the
    order of documents in the index.

(cherry picked from commit b98ca76)
coury-clark pushed a commit that referenced this pull request Jul 24, 2023
…55160)

This should reduce memory usage significantly, as in the common
case (no two documents share the same relative path), we will
end up processing one document at a time.

I've tested the new code with a 5.7GB Chromium index, and we're
able to process it with even 100MB of memory (at the cost of
increased GC pressure).

We need to iterate over the index twice, first to get all external
symbols,
and then to process documents, as document processing requires access
to the external symbols list. This means we need the ability
to seek to the start again. I've implemented that as follows:
- For small indexes, just read the index into a slice.
- For large indexes, save the compressed index to a temporary file
   on disk, and rely on the GC and page cache to transparently
   drop pages earlier in the file when under memory pressure.
   I figured decompressing is cheap enough, that it doesn't make
   sense to have the extra I/O overhead of reading/writing the
   uncompressed index.

Other changes:
- Documents are no longer sorted by relative path during iteration
  The order of iteration is still deterministic though as it matches the
  order of documents in the index.

Questions:
- Should we add instrumentation see the memory usage at different
  stages of processing an index?
- How do we add a memory usage test (either here or in the upstream
scip bindings)? ([internal Slack
discussion](https://sourcegraph.slack.com/archives/C3B3SDBMY/p1687751363185919))

## Test plan

- [x] Update existing tests
- [x] Add low memory usage test -- added upstream.
scip-code/scip#181 <br> Backport
b98ca76 from #53828

Co-authored-by: Varun Gandhi <varun.gandhi@sourcegraph.com>
@daxmc99

daxmc99 commented Aug 31, 2023

Copy link
Copy Markdown
Contributor

Related to inc-237

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants