feat(code insights): language stats speed improvements by using archive loading#62946
Conversation
|
|
|
for the pax_global_header you might be able to get around that issue by adding a |
Thank you for the idea! I found that my changes to the reader have seemed to cause this, which I reverted in https://github.com/sourcegraph/sourcegraph/pull/62946/commits/fb2536ec3cb74a5c4ab6be7e55c08a0c18bf1066. I'll double check the impact of pax_global_header so we don't report false numbers. |
|
The pax_global_header is not counting towards the language stats, because it's skipped at https://github.com/sourcegraph/sourcegraph/blob/0e2a668954f2f8169da51224902865686fa8e472/cmd/frontend/internal/inventory/inventory.go#L89 |
|
great! |
| if rc == nil || skipEnhancedLanguageDetection { | ||
| lang.Name = matchedLang | ||
| lang.TotalBytes = uint64(file.Size()) | ||
| return lang, nil |
There was a problem hiding this comment.
could we save the call to getFileReader here at all, if skipEnhancedLanguageDetection?
There was a problem hiding this comment.
I think we can drop the nil check and move the getFileReader call downwards. Previously nil indicated that the language check should be skipped, but we now have that as a bool.
Previous nil init (see deleted code in this PR):
if !useEnhancedLanguageDetection && !forceEnhancedLanguageDetection {
// If USE_ENHANCED_LANGUAGE_DETECTION is disabled, do not read file contents to determine
// the language. This means we won't calculate the number of lines per language.
invCtx.NewFileReader = func(ctx context.Context, path string) (io.ReadCloser, error) {
return nil, nil
}
}
New init of according bool:
ShouldSkipEnhancedLanguageDetection: !useEnhancedLanguageDetection && !forceEnhancedLanguageDetection,
| case entry.Mode().IsDir(): | ||
| // If we want to we could try to optimize cache invalidation at the tree level | ||
| // here. For now, we only iterate over all the files in the archive. | ||
| continue |
There was a problem hiding this comment.
I'm not sure if we had that before, but this is the code path that would allow for incremental indexing. We can leave that out for now, but should monitor the latency and load this causes for large repos (i.e., let's check on large cloud customers after the release).
There was a problem hiding this comment.
At the end of May I gave it a try to cache based on the directory, and found that it's a bit harder than before. This is mainly because now we get a flat list of records, files and directories in them.
I don't recall exactly if the directory came before the files or after, and am not sure if the order of files and directories is guaranteed. Can you confirm that @eseliger?
I had two ideas: (1) Remember the directory, process files, and when a new directory appears, cache the above. Discarded that again because it felt too hacky. (2) Load everything into a tree structure and process like before, but then we'll need to load everything and actually lose the benefit directory level caching would bring us.
There was a problem hiding this comment.
I don't recall exactly if the directory came before the files or after, and am not sure if the order of files and directories is guaranteed. Can you confirm that @eseliger?
I think the order should be guaranteed within one commit, but might differ between commits. No 100% guarantee - I'd need to check in the git code to be sure. How important is that? I don't think that's documented behavior of the API today, so would be something implicit we rely on.
There was a problem hiding this comment.
Within one commit is fine. I tried to use the folder as an open or close marker to its files. To determine when all files in a folder have been read and caching that before the whole process completes, the guaranteed order is essential.
Here's what I recall seeing:
...
folder1
folder1/a
folder2/a
folder2/b
folder2/c
folder2
folder3/a
...
If the enclosing folder always comes before or after the files, we can use the leading or trailing folder to determine when all files have been read. That's all the guarantees I'm looking for here. Either before or after, but always right before or right after all of its files.
There was a problem hiding this comment.
This is starting to sound like a leetcode problem 🙂
There was a problem hiding this comment.
It looks like git will always print the directory node first, then go into that tree, and so forth (depth first).
So it would be
folder (d)
folder/file
folder/nested (d)
folder/nested/file
folder/otherfile
rootfile
eseliger
left a comment
There was a problem hiding this comment.
A few last comments and Camden should also review, but from a Source perspective, this seems fine
| case entry.Mode().IsDir(): | ||
| // If we want to we could try to optimize cache invalidation at the tree level | ||
| // here. For now, we only iterate over all the files in the archive. | ||
| continue |
There was a problem hiding this comment.
It looks like git will always print the directory node first, then go into that tree, and so forth (depth first).
So it would be
folder (d)
folder/file
folder/nested (d)
folder/nested/file
folder/otherfile
rootfile
|
FYI: I'm currently focussing on the batch changes work with Bolaji, and will continue to work on this afterwards. Probably by the end of the week. |
|
@eseliger @camdencheek I completed another pass, and the PR is ready for you to review again. The major change is that I reworked the inventory code to be simpler to test, then added some complicated caching logic, and with @camdencheek decided to drop that caching again. This means the code is now just a bit simpler to test, and doesn't cache except at the repo-level. Let me know what you think! |
camdencheek
left a comment
There was a problem hiding this comment.
Almost there! Just a couple of quick things before I think this is ready to go
| switch { | ||
| case entry.Mode().IsRegular(): | ||
| lang, err := getLang(ctx, entry, func(ctx context.Context, path string) (io.ReadCloser, error) { | ||
| return n.FileReader, nil |
There was a problem hiding this comment.
(blocking)
When is FileReader closed?
There was a problem hiding this comment.
This happens in inventory.go with a defer right after the function to get the file reader is called.
rc, err := getFileReader(ctx, file.Name())
if err != nil {
return Lang{}, errors.Wrap(err, "Failed to create a file reader.")
}
if rc != nil {
defer rc.Close()
}
With https://github.com/sourcegraph/sourcegraph/pull/62946/commits/0c60af57a2f9536a910089e04c17b028a27b36c0 and https://github.com/sourcegraph/sourcegraph/pull/62946/commits/f0443e5d19768717b462d0982e5ad930a6e45906 I updated the code so that the file io.NopCloser (n.FileReader) is only created when the getFileReader function is called.
With https://github.com/sourcegraph/sourcegraph/commit/f8c5be275b9ededc64df577dadfa874628052fb3 I added unit tests to make sure that the file reader is closed after getLang access the function.
I feel like I have a gap when it comes to understanding how and when io readers are opened/closed, and what the impact of un-closed readers in golang are. Will try to read up on that, but if there are some dangerous that you'd like to call out, please don't hesitate! :)
There was a problem hiding this comment.
As far as I can see, it's just the general io problems with leaks when you don't close readers.
In the All code we have an ArchiveReader (which now gets closed with https://github.com/sourcegraph/sourcegraph/pull/62946/commits/2d90cef9cacbc557b32c57d6e32da2ac8fc489b7) that gets handed to various io.NopClosers. I think with the ArchiveReader's closing and the other calls to close, we should be safe.
Is there any automatic mechanism in our stack to close all remaining sockets if the request that opened them is done, i.e. garbage collection of handlers and sockets?
camdencheek
left a comment
There was a problem hiding this comment.
Looks great -- thank you!
We previously improved the performance of Language Stats Insights by introducing parallel requests to gitserver: https://github.com/sourcegraph/sourcegraph/pull/62011
This PR replaces the previous approach where we would iterate through and request each file from gitserver with an approach where we request just one archive. This eliminates a lot of network traffic, and gives us an additional(!) performance improvement of 70-90%.
Even repositories like chromium (42GB) can now be processed (on my machine in just one minute).
Caching: We dropped most of the caching, and kept only the top-level caching (repo@commit). This means that we only need to compute the language stats once per commit, and subsequent users/requests can see the cached data. We dropped the file/directory level caching, because (1) the code to do that got very complex and (2) we can assume that most repositories are able to compute within the 5 minutes timeout (which can be increase via the environment variable
GET_INVENTORY_TIMEOUT). The timeout is not bound to the user's request anymore. Before, the frontend would request the stats up to three times to let the computation move forward and pick up where the last request aborted. While we still have this frontend retry mechanism, we don't have to worry about an abort-and-continue mechanism in the backend.Credits for the code to @eseliger: https://github.com/sourcegraph/sourcegraph/issues/62019#issuecomment-2119278481
I've taken the diff, and updated the caching methods to allow for more advanced use cases should we decide to introduce more caching. We can take that out again if the current caching is sufficient.
Todos:
Test data:
Changelog
Test plan