uploads: Use streaming API for ingesting SCIP indexes#53828
Conversation
| }, | ||
| } | ||
| if err := secondPassVisitor.ParseStreaming(rr); err != nil { | ||
| // FIXME: How do we handle the error here properly? |
There was a problem hiding this comment.
Goroutine will have to be in a monitored waitgroup.
There was a problem hiding this comment.
Could you point me to some example code? Also, I didn't quite understand why there's a separate goroutine being spawned for package references below. What's up with that?
There was a problem hiding this comment.
conc.WaitGroup is a cool thing to use.
As for the flow of this function:
- We make four channels we asynchronously write to and return from this function. We write values into the channels in a way that's symmetric to how we consume them - here we share knowledge that all documents are written before any of the package metadata (as we need to process all documents for this data to be completete).
- The outer goroutine will write to the
documentschannel, and do some synchronous work to populate some shared data. - Once all documents are written, the documents channel is closed and the inner goroutine is started. This goroutine will send all package data on one of two channels, closing both once the shared map has been completely iterated.
This isn't necessarily the best shape to keep this code in. We don't get so much from the second goroutine sending channels instead of just the constructed map or a pair of slices (the problem is that we only have this data after we've asynchronously produced all documents).
There was a problem hiding this comment.
I've replaced the errors here with a logging statement, since we don't expect this to be hit in practice... We can clean up the error handling further when simplifying the control flow.
72d087a to
94b8382
Compare
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff 9fdf5de...5f6b072.
|
192cb0c to
a2ce2bb
Compare
3de8f2b to
f408276
Compare
Previously, the buffered reader case would attempt to read bytes directly from the gzip file as protobuf. 🤦
f408276 to
6f99e7d
Compare
This reduces memory usage significantly, as in the common
case (no two documents share the same relative path), we will
end up processing one document at a time.
I've tested the new code with a 5.7GB Chromium index, and we're
able to process it with even 100MB of memory (at the cost of
increased GC pressure).
We need to iterate over the index twice, first to get all external symbols,
and then to process documents, as document processing requires access
to the external symbols list. This means we need the ability
to seek to the start again. I've implemented that as follows:
- For small indexes, just read the index into a slice.
- For large indexes, save the compressed index to a temporary file
on disk, and rely on the GC and page cache to transparently
drop pages earlier in the file when under memory pressure.
I figured decompressing is cheap enough, that it doesn't make
sense to have the extra I/O overhead of reading/writing the
uncompressed index.
Other changes:
- Documents are no longer sorted by relative path during iteration
The order of iteration is still deterministic though as it matches the
order of documents in the index.
(cherry picked from commit b98ca76)
…55160) This should reduce memory usage significantly, as in the common case (no two documents share the same relative path), we will end up processing one document at a time. I've tested the new code with a 5.7GB Chromium index, and we're able to process it with even 100MB of memory (at the cost of increased GC pressure). We need to iterate over the index twice, first to get all external symbols, and then to process documents, as document processing requires access to the external symbols list. This means we need the ability to seek to the start again. I've implemented that as follows: - For small indexes, just read the index into a slice. - For large indexes, save the compressed index to a temporary file on disk, and rely on the GC and page cache to transparently drop pages earlier in the file when under memory pressure. I figured decompressing is cheap enough, that it doesn't make sense to have the extra I/O overhead of reading/writing the uncompressed index. Other changes: - Documents are no longer sorted by relative path during iteration The order of iteration is still deterministic though as it matches the order of documents in the index. Questions: - Should we add instrumentation see the memory usage at different stages of processing an index? - How do we add a memory usage test (either here or in the upstream scip bindings)? ([internal Slack discussion](https://sourcegraph.slack.com/archives/C3B3SDBMY/p1687751363185919)) ## Test plan - [x] Update existing tests - [x] Add low memory usage test -- added upstream. scip-code/scip#181 <br> Backport b98ca76 from #53828 Co-authored-by: Varun Gandhi <varun.gandhi@sourcegraph.com>
|
Related to inc-237 |
This should reduce memory usage significantly, as in the common
case (no two documents share the same relative path), we will
end up processing one document at a time.
I've tested the new code with a 5.7GB Chromium index, and we're
able to process it with even 100MB of memory (at the cost of
increased GC pressure).
We need to iterate over the index twice, first to get all external symbols,
and then to process documents, as document processing requires access
to the external symbols list. This means we need the ability
to seek to the start again. I've implemented that as follows:
on disk, and rely on the GC and page cache to transparently
drop pages earlier in the file when under memory pressure.
I figured decompressing is cheap enough, that it doesn't make
sense to have the extra I/O overhead of reading/writing the
uncompressed index.
Other changes:
The order of iteration is still deterministic though as it matches the
order of documents in the index.
Questions:
stages of processing an index?
scip bindings)? (internal Slack discussion)
Test plan