git: improve performance of remote ref listing#4176
Conversation
git/git.go
Outdated
| func parseShowRefLine(refPrefix, line string) (sha, name string, ok bool) { | ||
| const shaLen = 40 | ||
|
|
||
| refIdx := strings.Index(line, refPrefix) |
There was a problem hiding this comment.
I used strings.Index here because the original regexp allowed one or more spaces after the SHA. However, the output from show-ref and ls-remote seems well-defined, with a single space or a single tab respectively.
If supporting variations on the format isn't necessary, these two new parse functions might be simplified by looking for the ref at a specific offset in the string.
There was a problem hiding this comment.
I think the way I'd like to go here is to search for the first space or tab, as appropriate, and not use a hard-coded offset. Git will learn to support SHA-256, which has a longer length, very soon (within the next two releases) and I have a series for Git LFS which removes all of the hard-coded 40 constants here, as well as updating regexes to support it here as well. I'd like to avoid regressing that series if possible, so adding more assumptions about the length of the hash should be avoided.
It's probably safe to rely on the output of Git not changing to add additional spaces or tabs that aren't in the current version.
There was a problem hiding this comment.
That makes sense, I updated both functions to look for the space/tab separator.
50fda14 to
b9a5237
Compare
bk2204
left a comment
There was a problem hiding this comment.
As mentioned, I'd like to see fewer assumptions about the length of a hash, since Git will learn about SHA-256 in an upcoming version and we'll need to remove those assumptions soon anyway.
git/git.go
Outdated
| func parseShowRefLine(refPrefix, line string) (sha, name string, ok bool) { | ||
| const shaLen = 40 | ||
|
|
||
| refIdx := strings.Index(line, refPrefix) |
There was a problem hiding this comment.
I think the way I'd like to go here is to search for the first space or tab, as appropriate, and not use a hard-coded offset. Git will learn to support SHA-256, which has a longer length, very soon (within the next two releases) and I have a series for Git LFS which removes all of the hard-coded 40 constants here, as well as updating regexes to support it here as well. I'd like to avoid regressing that series if possible, so adding more assumptions about the length of the hash should be avoided.
It's probably safe to rely on the output of Git not changing to add additional spaces or tabs that aren't in the current version.
When repositories have a large number (hundreds of thousands) of refs, Git LFS adds noticeable overhead on the time taken by `git show-ref` and `git ls-remote`. To reduce this overhead: * Avoid regular expressions when parsing lines of Git command output. Regexp evaluation was the most expensive part of this process in profiling. * Minimize the number of loops when computing skipped refs by combining the missing and skipped checks and converting the list of remote refs into a pre-filtered set. In a repository with ~300k remote references, these changes reduce the time spent in the pre-push command from 8.26s to 6.31s.
b9a5237 to
5a7569a
Compare
|
Thanks for the patch and the prompt turnaround on revising it! |
When we cache files, do so on the full path instead of just the directory entry. This means that when we have an identical file with the same name in two different direectories, we distinguish between the two paths and ensure both are added to .gitattributes. This is an alternate solution to git-lfs#4176 which should perform better. For compmarison, with a clone of Git's main repository with the following command, we get: git lfs migrate import --everything --include="*.h": * v3.0.1 (broken): 608s user, 53s system, 5:34 total * v3.0.2 (fixed): 13435s user, 1255s system, 1:43:17 total * this commit (fixed): 716s user, 67s system, 6:59 total This is a much better performance characteristic for equivalent results. Preserve the integration from the earlier attempt at fixing this plus add an additional one. Avoid using assert_pointer in the new test because that helper doesn't always work correctly when there are two files with the same file name.
When we cache files, do so on the full path instead of just the directory entry. This means that when we have an identical file with the same name in two different direectories, we distinguish between the two paths and ensure both are added to .gitattributes. This is an alternate solution to git-lfs#4176 which should perform better. For compmarison, with a clone of Git's main repository with the following command, we get: git lfs migrate import --everything --include="*.h": * v3.0.1 (broken): 608s user, 53s system, 5:34 total * v3.0.2 (fixed): 13435s user, 1255s system, 1:43:17 total * this commit (fixed): 716s user, 67s system, 6:59 total This is a much better performance characteristic for equivalent results. Preserve the integration from the earlier attempt at fixing this plus add an additional one. Avoid using assert_pointer in the new test because that helper doesn't always work correctly when there are two files with the same file name.
When repositories have a large number (hundreds of thousands) of refs, Git LFS adds noticeable overhead on the time taken by
git show-refandgit ls-remote. To reduce this overhead:In a repository with ~300k remote references, these changes reduce the time spent in the
pre-pushcommand from 8.26s to 6.31s. The individual optimizations were identified and verified with pprof.I'm happy to break these up into two separate PRs or commits if that's preferred.