feat(code insights): language stats speed improvements by using archive loading by bahrmichael · Pull Request #62946 · sourcegraph/sourcegraph-public-snapshot

bahrmichael · 2024-05-28T14:23:54Z

We previously improved the performance of Language Stats Insights by introducing parallel requests to gitserver: https://github.com/sourcegraph/sourcegraph/pull/62011

This PR replaces the previous approach where we would iterate through and request each file from gitserver with an approach where we request just one archive. This eliminates a lot of network traffic, and gives us an additional(!) performance improvement of 70-90%.

Even repositories like chromium (42GB) can now be processed (on my machine in just one minute).

Caching: We dropped most of the caching, and kept only the top-level caching (repo@commit). This means that we only need to compute the language stats once per commit, and subsequent users/requests can see the cached data. We dropped the file/directory level caching, because (1) the code to do that got very complex and (2) we can assume that most repositories are able to compute within the 5 minutes timeout (which can be increase via the environment variable GET_INVENTORY_TIMEOUT). The timeout is not bound to the user's request anymore. Before, the frontend would request the stats up to three times to let the computation move forward and pick up where the last request aborted. While we still have this frontend retry mechanism, we don't have to worry about an abort-and-continue mechanism in the backend.

Credits for the code to @eseliger: https://github.com/sourcegraph/sourcegraph/issues/62019#issuecomment-2119278481

I've taken the diff, and updated the caching methods to allow for more advanced use cases should we decide to introduce more caching. We can take that out again if the current caching is sufficient.

Todos:

Check if CI passes, manual testing seems to be fine
Verify that insights are cached at the top level

Test data:

sourcegraph/sourcegraph: 9.07s (main) -> 1.44s (current): 74% better
facebook/react: 17.52s (main) -> 0.87s (current): 95% better
godotengine/godot: 28.92s (main) -> 1.98s (current): 93% better
chromium/chromium: ~1 minute: 100% better, because it didn't compute before

Changelog

Language stats queries now request one archive from gitserver instead of individual file requests. This leads to a huge performance improvement. Even extra large repositories like chromium are now able to compute within one minute. Previously they timed out.

Test plan

New unit tests
Plenty of manual testing

…anced usage

bahrmichael · 2024-05-29T14:31:06Z

Currently two problems, the pax_global_header is probably from my changes in https://github.com/sourcegraph/sourcegraph/pull/62946/commits/ec96924a738785a1316e83d27bac8d64797f9f08. Seems like we're missing e2e tests here.

…ion mechanism

eseliger · 2024-05-30T14:05:36Z

for the pax_global_header you might be able to get around that issue by adding a fd.IsRegular() check. It's a magical file that is unfortunately added to tar archives :|

bahrmichael · 2024-05-30T14:10:21Z

for the pax_global_header you might be able to get around that issue by adding a fd.IsRegular() check. It's a magical file that is unfortunately added to tar archives :|

Thank you for the idea! I found that my changes to the reader have seemed to cause this, which I reverted in https://github.com/sourcegraph/sourcegraph/pull/62946/commits/fb2536ec3cb74a5c4ab6be7e55c08a0c18bf1066.

I'll double check the impact of pax_global_header so we don't report false numbers.

bahrmichael · 2024-05-30T14:23:50Z

The pax_global_header is not counting towards the language stats, because it's skipped at https://github.com/sourcegraph/sourcegraph/blob/0e2a668954f2f8169da51224902865686fa8e472/cmd/frontend/internal/inventory/inventory.go#L89

eseliger · 2024-05-30T14:25:17Z

great!

eseliger · 2024-05-31T18:15:05Z

+	if rc == nil || skipEnhancedLanguageDetection {
 		lang.Name = matchedLang
 		lang.TotalBytes = uint64(file.Size())
 		return lang, nil


could we save the call to getFileReader here at all, if skipEnhancedLanguageDetection?

I think we can drop the nil check and move the getFileReader call downwards. Previously nil indicated that the language check should be skipped, but we now have that as a bool.

Previous nil init (see deleted code in this PR):

if !useEnhancedLanguageDetection && !forceEnhancedLanguageDetection { // If USE_ENHANCED_LANGUAGE_DETECTION is disabled, do not read file contents to determine // the language. This means we won't calculate the number of lines per language. invCtx.NewFileReader = func(ctx context.Context, path string) (io.ReadCloser, error) { return nil, nil } }

New init of according bool:

ShouldSkipEnhancedLanguageDetection: !useEnhancedLanguageDetection && !forceEnhancedLanguageDetection,

eseliger · 2024-05-31T18:19:03Z

+		case entry.Mode().IsDir():
+			// If we want to we could try to optimize cache invalidation at the tree level
+			// here. For now, we only iterate over all the files in the archive.
+			continue


I'm not sure if we had that before, but this is the code path that would allow for incremental indexing. We can leave that out for now, but should monitor the latency and load this causes for large repos (i.e., let's check on large cloud customers after the release).

At the end of May I gave it a try to cache based on the directory, and found that it's a bit harder than before. This is mainly because now we get a flat list of records, files and directories in them.

I don't recall exactly if the directory came before the files or after, and am not sure if the order of files and directories is guaranteed. Can you confirm that @eseliger?

I had two ideas: (1) Remember the directory, process files, and when a new directory appears, cache the above. Discarded that again because it felt too hacky. (2) Load everything into a tree structure and process like before, but then we'll need to load everything and actually lose the benefit directory level caching would bring us.

I don't recall exactly if the directory came before the files or after, and am not sure if the order of files and directories is guaranteed. Can you confirm that @eseliger?

I think the order should be guaranteed within one commit, but might differ between commits. No 100% guarantee - I'd need to check in the git code to be sure. How important is that? I don't think that's documented behavior of the API today, so would be something implicit we rely on.

Within one commit is fine. I tried to use the folder as an open or close marker to its files. To determine when all files in a folder have been read and caching that before the whole process completes, the guaranteed order is essential.

Here's what I recall seeing:

... folder1 folder1/a folder2/a folder2/b folder2/c folder2 folder3/a ...

If the enclosing folder always comes before or after the files, we can use the leading or trailing folder to determine when all files have been read. That's all the guarantees I'm looking for here. Either before or after, but always right before or right after all of its files.

This is starting to sound like a leetcode problem 🙂

It looks like git will always print the directory node first, then go into that tree, and so forth (depth first).

So it would be

folder (d) folder/file folder/nested (d) folder/nested/file folder/otherfile rootfile

eseliger

A few last comments and Camden should also review, but from a Source perspective, this seems fine

eseliger · 2024-07-01T21:41:35Z

+		case entry.Mode().IsDir():
+			// If we want to we could try to optimize cache invalidation at the tree level
+			// here. For now, we only iterate over all the files in the archive.
+			continue


It looks like git will always print the directory node first, then go into that tree, and so forth (depth first).

So it would be

folder (d) folder/file folder/nested (d) folder/nested/file folder/otherfile rootfile

bahrmichael · 2024-07-09T08:26:04Z

FYI: I'm currently focussing on the batch changes work with Bolaji, and will continue to work on this afterwards. Probably by the end of the week.

bahrmichael · 2024-07-16T11:46:59Z

@eseliger @camdencheek I completed another pass, and the PR is ready for you to review again. The major change is that I reworked the inventory code to be simpler to test, then added some complicated caching logic, and with @camdencheek decided to drop that caching again. This means the code is now just a bit simpler to test, and doesn't cache except at the repo-level. Let me know what you think!

camdencheek

Almost there! Just a couple of quick things before I think this is ready to go

camdencheek · 2024-07-16T15:04:27Z

+		switch {
+		case entry.Mode().IsRegular():
+			lang, err := getLang(ctx, entry, func(ctx context.Context, path string) (io.ReadCloser, error) {
+				return n.FileReader, nil


(blocking)
When is FileReader closed?

This happens in inventory.go with a defer right after the function to get the file reader is called.

rc, err := getFileReader(ctx, file.Name()) if err != nil { return Lang{}, errors.Wrap(err, "Failed to create a file reader.") } if rc != nil { defer rc.Close() }

With https://github.com/sourcegraph/sourcegraph/pull/62946/commits/0c60af57a2f9536a910089e04c17b028a27b36c0 and https://github.com/sourcegraph/sourcegraph/pull/62946/commits/f0443e5d19768717b462d0982e5ad930a6e45906 I updated the code so that the file io.NopCloser (n.FileReader) is only created when the getFileReader function is called.

With https://github.com/sourcegraph/sourcegraph/commit/f8c5be275b9ededc64df577dadfa874628052fb3 I added unit tests to make sure that the file reader is closed after getLang access the function.

I feel like I have a gap when it comes to understanding how and when io readers are opened/closed, and what the impact of un-closed readers in golang are. Will try to read up on that, but if there are some dangerous that you'd like to call out, please don't hesitate! :)

As far as I can see, it's just the general io problems with leaks when you don't close readers.

In the All code we have an ArchiveReader (which now gets closed with https://github.com/sourcegraph/sourcegraph/pull/62946/commits/2d90cef9cacbc557b32c57d6e32da2ac8fc489b7) that gets handed to various io.NopClosers. I think with the ArchiveReader's closing and the other calls to close, we should be safe.

Is there any automatic mechanism in our stack to close all remaining sockets if the request that opened them is done, i.e. garbage collection of handlers and sockets?

camdencheek

Looks great -- thank you!

feat: use archiving (by @eseliger); update cache methods for more adv…

b9063c0

…anced usage

cla-bot Bot added the cla-signed label May 28, 2024

bahrmichael added 2 commits May 28, 2024 16:26

fix: run bazel configure

e457866

fix: revert cache change; fix tests

ec96924

bahrmichael added 3 commits May 30, 2024 12:11

chore: add graphql test

53e6bd6

chore: make the test fail

4f3b5d5

fix: revert some reader changes and rework the enhancedLanguageDetect…

fb2536e

…ion mechanism

chore: delete generated test

7226dbb

feat: add top level caching

0d28dbb

peterguy requested a review from camdencheek May 30, 2024 17:35

fix: test

169e82a

bahrmichael marked this pull request as ready for review May 31, 2024 07:52

Merge branch 'main' into bahrmichael/further-language-stats-improvements

2ffd173

bahrmichael requested a review from eseliger May 31, 2024 07:56

chore: add changelog

ecca0f8

bahrmichael changed the title ~~feat: language stats speed improvements by using archive loading~~ feat(code insights): language stats speed improvements by using archive loading May 31, 2024

chore: update changelog

d1494a4

eseliger approved these changes May 31, 2024

View reviewed changes

bahrmichael assigned eseliger and camdencheek Jun 3, 2024

bahrmichael added 5 commits June 25, 2024 13:13

merge from main

fe872b9

chore: fix linting issue

ca0ad9a

feat: include commit ID in language stats cache

4e61033

fix: rework caching to make tests pass

a4bd935

chore: fix test

2f7d5c4

bahrmichael added 2 commits June 28, 2024 09:57

fix: remove dangling parameters

75c1eb5

fix: remove dangling parameters

b648fec

eseliger approved these changes Jul 1, 2024

View reviewed changes

fkling requested a review from camdencheek July 3, 2024 09:23

bahrmichael and others added 7 commits July 15, 2024 14:33

chore: rewrite tar processing to cacheable approach

14aa1b1

chore: call Sum only on directory compression

e8b2c09

chore: log warning when we're refusing to compute the inventory

a318e28

chore: revert to basic loop without directory caching

03f0618

chore: ci fixes; cleanup; error handling

d84c831

chore: ci fixes; move timeout into All function

2e7b8a3

Merge branch 'main' into bahrmichael/further-language-stats-improvements

dcd1f2d

bahrmichael requested a review from eseliger July 16, 2024 11:47

chore: cleanup

1f4f861

camdencheek suggested changes Jul 16, 2024

View reviewed changes

bahrmichael and others added 10 commits July 17, 2024 07:38

Merge branch 'main' into bahrmichael/further-language-stats-improvements

cf74802

chore: bring back context

719c13d

chore: rename test

605d6c4

chore: add test for file reader closing

f8c5be2

chore: drop tar.NewReader wrapper

9c25482

chore: reorganize Inventory fields and drop NewTarReader from interface

d0228c7

chore: move io.NopCloser creation into a function

0c60af5

chore: move io.NopCloser creation into a function

f0443e5

chore: close ArchiveReader after All completes

2d90cef

chore: drop error handling for missing commitID

83e5ee3

bahrmichael requested a review from camdencheek July 17, 2024 15:04

camdencheek approved these changes Jul 17, 2024

View reviewed changes

bahrmichael merged commit f61e637 into main Jul 18, 2024

bahrmichael deleted the bahrmichael/further-language-stats-improvements branch July 18, 2024 06:40

Conversation

bahrmichael commented May 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog

Test plan

Uh oh!

bahrmichael commented May 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eseliger commented May 30, 2024

Uh oh!

bahrmichael commented May 30, 2024

Uh oh!

bahrmichael commented May 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eseliger commented May 30, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eseliger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bahrmichael commented Jul 9, 2024

Uh oh!

bahrmichael commented Jul 16, 2024

Uh oh!

camdencheek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

camdencheek Jul 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bahrmichael Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bahrmichael Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

camdencheek left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bahrmichael commented May 28, 2024 •

edited

Loading

bahrmichael commented May 29, 2024 •

edited

Loading

bahrmichael commented May 30, 2024 •

edited

Loading

camdencheek Jul 16, 2024 •

edited

Loading

bahrmichael Jul 17, 2024 •

edited

Loading

bahrmichael Jul 17, 2024 •

edited

Loading