Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

feat(code insights): language stats speed improvements by using archive loading#62946

Merged
bahrmichael merged 50 commits into
mainfrom
bahrmichael/further-language-stats-improvements
Jul 18, 2024
Merged

feat(code insights): language stats speed improvements by using archive loading#62946
bahrmichael merged 50 commits into
mainfrom
bahrmichael/further-language-stats-improvements

Conversation

@bahrmichael

@bahrmichael bahrmichael commented May 28, 2024

Copy link
Copy Markdown
Contributor

We previously improved the performance of Language Stats Insights by introducing parallel requests to gitserver: https://github.com/sourcegraph/sourcegraph/pull/62011

This PR replaces the previous approach where we would iterate through and request each file from gitserver with an approach where we request just one archive. This eliminates a lot of network traffic, and gives us an additional(!) performance improvement of 70-90%.

Even repositories like chromium (42GB) can now be processed (on my machine in just one minute).


Caching: We dropped most of the caching, and kept only the top-level caching (repo@commit). This means that we only need to compute the language stats once per commit, and subsequent users/requests can see the cached data. We dropped the file/directory level caching, because (1) the code to do that got very complex and (2) we can assume that most repositories are able to compute within the 5 minutes timeout (which can be increase via the environment variable GET_INVENTORY_TIMEOUT). The timeout is not bound to the user's request anymore. Before, the frontend would request the stats up to three times to let the computation move forward and pick up where the last request aborted. While we still have this frontend retry mechanism, we don't have to worry about an abort-and-continue mechanism in the backend.


Credits for the code to @eseliger: https://github.com/sourcegraph/sourcegraph/issues/62019#issuecomment-2119278481

I've taken the diff, and updated the caching methods to allow for more advanced use cases should we decide to introduce more caching. We can take that out again if the current caching is sufficient.

Todos:

  • Check if CI passes, manual testing seems to be fine
  • Verify that insights are cached at the top level

Test data:

  • sourcegraph/sourcegraph: 9.07s (main) -> 1.44s (current): 74% better
  • facebook/react: 17.52s (main) -> 0.87s (current): 95% better
  • godotengine/godot: 28.92s (main) -> 1.98s (current): 93% better
  • chromium/chromium: ~1 minute: 100% better, because it didn't compute before

Changelog

  • Language stats queries now request one archive from gitserver instead of individual file requests. This leads to a huge performance improvement. Even extra large repositories like chromium are now able to compute within one minute. Previously they timed out.

Test plan

  • New unit tests
  • Plenty of manual testing

@cla-bot cla-bot Bot added the cla-signed label May 28, 2024
@bahrmichael

bahrmichael commented May 29, 2024

Copy link
Copy Markdown
Contributor Author

Currently two problems, the pax_global_header is probably from my changes in https://github.com/sourcegraph/sourcegraph/pull/62946/commits/ec96924a738785a1316e83d27bac8d64797f9f08. Seems like we're missing e2e tests here.

Screenshot 2024-05-29 at 16 30 00

Screenshot 2024-05-29 at 16 29 49

@eseliger

Copy link
Copy Markdown
Member

for the pax_global_header you might be able to get around that issue by adding a fd.IsRegular() check. It's a magical file that is unfortunately added to tar archives :|

@bahrmichael

Copy link
Copy Markdown
Contributor Author

for the pax_global_header you might be able to get around that issue by adding a fd.IsRegular() check. It's a magical file that is unfortunately added to tar archives :|

Thank you for the idea! I found that my changes to the reader have seemed to cause this, which I reverted in https://github.com/sourcegraph/sourcegraph/pull/62946/commits/fb2536ec3cb74a5c4ab6be7e55c08a0c18bf1066.

I'll double check the impact of pax_global_header so we don't report false numbers.

@bahrmichael

bahrmichael commented May 30, 2024

Copy link
Copy Markdown
Contributor Author

The pax_global_header is not counting towards the language stats, because it's skipped at https://github.com/sourcegraph/sourcegraph/blob/0e2a668954f2f8169da51224902865686fa8e472/cmd/frontend/internal/inventory/inventory.go#L89

@eseliger

Copy link
Copy Markdown
Member

great!

@peterguy peterguy requested a review from camdencheek May 30, 2024 17:35
@bahrmichael bahrmichael marked this pull request as ready for review May 31, 2024 07:52
@bahrmichael bahrmichael requested a review from eseliger May 31, 2024 07:56
@bahrmichael bahrmichael changed the title feat: language stats speed improvements by using archive loading feat(code insights): language stats speed improvements by using archive loading May 31, 2024
Comment on lines 65 to 68
if rc == nil || skipEnhancedLanguageDetection {
lang.Name = matchedLang
lang.TotalBytes = uint64(file.Size())
return lang, nil

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we save the call to getFileReader here at all, if skipEnhancedLanguageDetection?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop the nil check and move the getFileReader call downwards. Previously nil indicated that the language check should be skipped, but we now have that as a bool.

Previous nil init (see deleted code in this PR):

	if !useEnhancedLanguageDetection && !forceEnhancedLanguageDetection {
		// If USE_ENHANCED_LANGUAGE_DETECTION is disabled, do not read file contents to determine
		// the language. This means we won't calculate the number of lines per language.
		invCtx.NewFileReader = func(ctx context.Context, path string) (io.ReadCloser, error) {
			return nil, nil
		}
	}

New init of according bool:

ShouldSkipEnhancedLanguageDetection: !useEnhancedLanguageDetection && !forceEnhancedLanguageDetection,

Comment thread cmd/frontend/internal/inventory/entries.go Outdated
Comment thread cmd/frontend/internal/inventory/entries.go
Comment thread cmd/frontend/internal/inventory/entries.go
Comment on lines +53 to +56
case entry.Mode().IsDir():
// If we want to we could try to optimize cache invalidation at the tree level
// here. For now, we only iterate over all the files in the archive.
continue

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we had that before, but this is the code path that would allow for incremental indexing. We can leave that out for now, but should monitor the latency and load this causes for large repos (i.e., let's check on large cloud customers after the release).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end of May I gave it a try to cache based on the directory, and found that it's a bit harder than before. This is mainly because now we get a flat list of records, files and directories in them.

I don't recall exactly if the directory came before the files or after, and am not sure if the order of files and directories is guaranteed. Can you confirm that @eseliger?

I had two ideas: (1) Remember the directory, process files, and when a new directory appears, cache the above. Discarded that again because it felt too hacky. (2) Load everything into a tree structure and process like before, but then we'll need to load everything and actually lose the benefit directory level caching would bring us.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall exactly if the directory came before the files or after, and am not sure if the order of files and directories is guaranteed. Can you confirm that @eseliger?

I think the order should be guaranteed within one commit, but might differ between commits. No 100% guarantee - I'd need to check in the git code to be sure. How important is that? I don't think that's documented behavior of the API today, so would be something implicit we rely on.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within one commit is fine. I tried to use the folder as an open or close marker to its files. To determine when all files in a folder have been read and caching that before the whole process completes, the guaranteed order is essential.

Here's what I recall seeing:

...
folder1
folder1/a
folder2/a
folder2/b
folder2/c
folder2
folder3/a
...

If the enclosing folder always comes before or after the files, we can use the leading or trailing folder to determine when all files have been read. That's all the guarantees I'm looking for here. Either before or after, but always right before or right after all of its files.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is starting to sound like a leetcode problem 🙂

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like git will always print the directory node first, then go into that tree, and so forth (depth first).

So it would be

folder (d)
folder/file
folder/nested (d)
folder/nested/file
folder/otherfile
rootfile

@eseliger eseliger left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few last comments and Camden should also review, but from a Source perspective, this seems fine

Comment thread cmd/frontend/backend/inventory.go
Comment thread cmd/frontend/backend/repos.go Outdated
Comment thread cmd/frontend/internal/inventory/entries.go
Comment on lines +53 to +56
case entry.Mode().IsDir():
// If we want to we could try to optimize cache invalidation at the tree level
// here. For now, we only iterate over all the files in the archive.
continue

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like git will always print the directory node first, then go into that tree, and so forth (depth first).

So it would be

folder (d)
folder/file
folder/nested (d)
folder/nested/file
folder/otherfile
rootfile

@fkling fkling requested a review from camdencheek July 3, 2024 09:23
@bahrmichael

Copy link
Copy Markdown
Contributor Author

FYI: I'm currently focussing on the batch changes work with Bolaji, and will continue to work on this afterwards. Probably by the end of the week.

@bahrmichael

Copy link
Copy Markdown
Contributor Author

@eseliger @camdencheek I completed another pass, and the PR is ready for you to review again. The major change is that I reworked the inventory code to be simpler to test, then added some complicated caching logic, and with @camdencheek decided to drop that caching again. This means the code is now just a bit simpler to test, and doesn't cache except at the repo-level. Let me know what you think!

@bahrmichael bahrmichael requested a review from eseliger July 16, 2024 11:47

@camdencheek camdencheek left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there! Just a couple of quick things before I think this is ready to go

Comment thread cmd/frontend/backend/inventory.go Outdated
Comment thread cmd/frontend/backend/inventory.go Outdated
Comment thread cmd/frontend/graphqlbackend/git_commit.go Outdated
switch {
case entry.Mode().IsRegular():
lang, err := getLang(ctx, entry, func(ctx context.Context, path string) (io.ReadCloser, error) {
return n.FileReader, nil

@camdencheek camdencheek Jul 16, 2024

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(blocking)
When is FileReader closed?

@bahrmichael bahrmichael Jul 17, 2024

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens in inventory.go with a defer right after the function to get the file reader is called.

	rc, err := getFileReader(ctx, file.Name())
	if err != nil {
		return Lang{}, errors.Wrap(err, "Failed to create a file reader.")
	}
	if rc != nil {
		defer rc.Close()
	}

With https://github.com/sourcegraph/sourcegraph/pull/62946/commits/0c60af57a2f9536a910089e04c17b028a27b36c0 and https://github.com/sourcegraph/sourcegraph/pull/62946/commits/f0443e5d19768717b462d0982e5ad930a6e45906 I updated the code so that the file io.NopCloser (n.FileReader) is only created when the getFileReader function is called.

With https://github.com/sourcegraph/sourcegraph/commit/f8c5be275b9ededc64df577dadfa874628052fb3 I added unit tests to make sure that the file reader is closed after getLang access the function.

I feel like I have a gap when it comes to understanding how and when io readers are opened/closed, and what the impact of un-closed readers in golang are. Will try to read up on that, but if there are some dangerous that you'd like to call out, please don't hesitate! :)

@bahrmichael bahrmichael Jul 17, 2024

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, it's just the general io problems with leaks when you don't close readers.

In the All code we have an ArchiveReader (which now gets closed with https://github.com/sourcegraph/sourcegraph/pull/62946/commits/2d90cef9cacbc557b32c57d6e32da2ac8fc489b7) that gets handed to various io.NopClosers. I think with the ArchiveReader's closing and the other calls to close, we should be safe.

Is there any automatic mechanism in our stack to close all remaining sockets if the request that opened them is done, i.e. garbage collection of handlers and sockets?

Comment thread cmd/frontend/internal/inventory/entries.go Outdated
@bahrmichael bahrmichael requested a review from camdencheek July 17, 2024 15:04

@camdencheek camdencheek left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great -- thank you!

@bahrmichael bahrmichael merged commit f61e637 into main Jul 18, 2024
@bahrmichael bahrmichael deleted the bahrmichael/further-language-stats-improvements branch July 18, 2024 06:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants