uploads: Use streaming API for ingesting SCIP indexes by varungandhi-src · Pull Request #53828 · sourcegraph/sourcegraph-public-snapshot

varungandhi-src · 2023-06-21T06:10:25Z

This should reduce memory usage significantly, as in the common
case (no two documents share the same relative path), we will
end up processing one document at a time.

I've tested the new code with a 5.7GB Chromium index, and we're
able to process it with even 100MB of memory (at the cost of
increased GC pressure).

We need to iterate over the index twice, first to get all external symbols,
and then to process documents, as document processing requires access
to the external symbols list. This means we need the ability
to seek to the start again. I've implemented that as follows:

For small indexes, just read the index into a slice.
For large indexes, save the compressed index to a temporary file
on disk, and rely on the GC and page cache to transparently
drop pages earlier in the file when under memory pressure.
I figured decompressing is cheap enough, that it doesn't make
sense to have the extra I/O overhead of reading/writing the
uncompressed index.

Other changes:

Documents are no longer sorted by relative path during iteration
The order of iteration is still deterministic though as it matches the
order of documents in the index.

Questions:

Should we add instrumentation see the memory usage at different
stages of processing an index?
How do we add a memory usage test (either here or in the upstream
scip bindings)? (internal Slack discussion)

Test plan

Update existing tests
Add low memory usage test -- added upstream. bindings/go: Add memory usage test for streaming parser scip-code/scip#181

efritz · 2023-06-21T12:34:50Z

+		},
+		}
+		if err := secondPassVisitor.ParseStreaming(rr); err != nil {
+			// FIXME: How do we handle the error here properly?


Goroutine will have to be in a monitored waitgroup.

Could you point me to some example code? Also, I didn't quite understand why there's a separate goroutine being spawned for package references below. What's up with that?

conc.WaitGroup is a cool thing to use.

As for the flow of this function:

We make four channels we asynchronously write to and return from this function. We write values into the channels in a way that's symmetric to how we consume them - here we share knowledge that all documents are written before any of the package metadata (as we need to process all documents for this data to be completete).

The outer goroutine will write to the documents channel, and do some synchronous work to populate some shared data.

Once all documents are written, the documents channel is closed and the inner goroutine is started. This goroutine will send all package data on one of two channels, closing both once the shared map has been completely iterated.

This isn't necessarily the best shape to keep this code in. We don't get so much from the second goroutine sending channels instead of just the constructed map or a pair of slices (the problem is that we only have this data after we've asynchronously produced all documents).

I've replaced the errors here with a logging statement, since we don't expect this to be hit in practice... We can clean up the error handling further when simplifying the control flow.

sourcegraph-bot · 2023-06-26T03:54:48Z

Codenotify: Notifying subscribers in CODENOTIFY files for diff 9fdf5de...5f6b072.

Notify	File(s)
@Strum355	internal/codeintel/uploads/internal/background/processor/BUILD.bazel internal/codeintel/uploads/internal/background/processor/job_worker_handler.go internal/codeintel/uploads/internal/background/processor/scip.go internal/codeintel/uploads/internal/background/processor/scip_test.go internal/codeintel/uploads/shared/types.go
@efritz	internal/codeintel/uploads/internal/background/processor/BUILD.bazel internal/codeintel/uploads/internal/background/processor/job_worker_handler.go internal/codeintel/uploads/internal/background/processor/scip.go internal/codeintel/uploads/internal/background/processor/scip_test.go internal/codeintel/uploads/shared/types.go

Previously, the buffered reader case would attempt to read bytes directly from the gzip file as protobuf. 🤦

This reduces memory usage significantly, as in the common case (no two documents share the same relative path), we will end up processing one document at a time. I've tested the new code with a 5.7GB Chromium index, and we're able to process it with even 100MB of memory (at the cost of increased GC pressure). We need to iterate over the index twice, first to get all external symbols, and then to process documents, as document processing requires access to the external symbols list. This means we need the ability to seek to the start again. I've implemented that as follows: - For small indexes, just read the index into a slice. - For large indexes, save the compressed index to a temporary file on disk, and rely on the GC and page cache to transparently drop pages earlier in the file when under memory pressure. I figured decompressing is cheap enough, that it doesn't make sense to have the extra I/O overhead of reading/writing the uncompressed index. Other changes: - Documents are no longer sorted by relative path during iteration The order of iteration is still deterministic though as it matches the order of documents in the index. (cherry picked from commit b98ca76)

…55160) This should reduce memory usage significantly, as in the common case (no two documents share the same relative path), we will end up processing one document at a time. I've tested the new code with a 5.7GB Chromium index, and we're able to process it with even 100MB of memory (at the cost of increased GC pressure). We need to iterate over the index twice, first to get all external symbols, and then to process documents, as document processing requires access to the external symbols list. This means we need the ability to seek to the start again. I've implemented that as follows: - For small indexes, just read the index into a slice. - For large indexes, save the compressed index to a temporary file on disk, and rely on the GC and page cache to transparently drop pages earlier in the file when under memory pressure. I figured decompressing is cheap enough, that it doesn't make sense to have the extra I/O overhead of reading/writing the uncompressed index. Other changes: - Documents are no longer sorted by relative path during iteration The order of iteration is still deterministic though as it matches the order of documents in the index. Questions: - Should we add instrumentation see the memory usage at different stages of processing an index? - How do we add a memory usage test (either here or in the upstream scip bindings)? ([internal Slack discussion](https://sourcegraph.slack.com/archives/C3B3SDBMY/p1687751363185919)) ## Test plan - [x] Update existing tests - [x] Add low memory usage test -- added upstream. scip-code/scip#181 <br> Backport b98ca76 from #53828 Co-authored-by: Varun Gandhi <varun.gandhi@sourcegraph.com>

daxmc99 · 2023-08-31T16:19:04Z

Related to inc-237

cla-bot Bot added the cla-signed label Jun 21, 2023

varungandhi-src commented Jun 21, 2023

View reviewed changes

Comment thread enterprise/internal/codeintel/uploads/internal/background/processor/metrics_resetter.go Outdated

varungandhi-src commented Jun 21, 2023

View reviewed changes

Comment thread enterprise/internal/codeintel/uploads/internal/background/processor/scip.go Outdated

varungandhi-src commented Jun 21, 2023

View reviewed changes

Comment thread enterprise/internal/codeintel/uploads/internal/background/processor/scip.go Outdated

efritz reviewed Jun 21, 2023

View reviewed changes

varungandhi-src commented Jun 23, 2023

View reviewed changes

Comment thread enterprise/internal/codeintel/uploads/internal/background/processor/job_worker_handler.go Outdated

varungandhi-src commented Jun 23, 2023

View reviewed changes

Comment thread enterprise/internal/codeintel/uploads/internal/background/processor/scip.go Outdated

varungandhi-src commented Jun 23, 2023

View reviewed changes

Comment thread enterprise/internal/codeintel/uploads/internal/background/processor/scip.go Outdated

varungandhi-src force-pushed the vg/stream-indexes branch from 72d087a to 94b8382 Compare June 26, 2023 02:38

varungandhi-src marked this pull request as ready for review June 26, 2023 03:52

varungandhi-src changed the title ~~WIP: Use streaming API for parsing indexes~~ uploads: Use streaming API for ingesting SCIP indexes Jun 26, 2023

varungandhi-src commented Jun 26, 2023

View reviewed changes

Comment thread enterprise/internal/codeintel/uploads/internal/background/processor/job_worker_handler.go Outdated

varungandhi-src mentioned this pull request Jun 28, 2023

bindings/go: Add memory usage test for streaming parser scip-code/scip#181

Merged

varungandhi-src force-pushed the vg/stream-indexes branch from 192cb0c to a2ce2bb Compare July 3, 2023 03:39

varungandhi-src added the backport 5.1 label Jul 3, 2023

varungandhi-src requested a review from efritz July 3, 2023 03:49

jhchabran force-pushed the vg/stream-indexes branch from 3de8f2b to f408276 Compare July 3, 2023 07:18

varungandhi-src added 12 commits July 6, 2023 13:38

WIP: Use streaming API for parsing indexes

f94c013

fix: Consistently wrap upload reader in gzip.Reader

6165cc6

Previously, the buffered reader case would attempt to read bytes directly from the gzip file as protobuf. 🤦

tidy: Fix incorrectly changed error string

9f4183e

tidy: Rearrange imports and delete commented code

6eca903

fix: Remove wrong dereference op in test

a62796e

fix: Use pointer receiver for seekToStart since it is mutating.

7413ed9

test: Replace some Errorf with Fatalf calls

40b2874

cleanup: Minor tweaks based on review comments

4ce48ab

chore: bazel configure

298a049

fix: Replace panics with logging

15f7b1a

Add better doc comment for uncompressedSizeLimitBytes

f7910f7

chore: bazel configure

bbf614d

jhchabran and others added 2 commits July 6, 2023 13:40

go: fix a linting error (unparam)

df9304d

Fix incorrect merge conflict resolution

6f99e7d

varungandhi-src force-pushed the vg/stream-indexes branch from f408276 to 6f99e7d Compare July 6, 2023 05:44

varungandhi-src enabled auto-merge (squash) July 6, 2023 05:57