Skip to content

git: improve performance of remote ref listing#4176

Merged
bk2204 merged 1 commit intogit-lfs:masterfrom
bluekeyes:optimize-ref-listing
Jul 6, 2020
Merged

git: improve performance of remote ref listing#4176
bk2204 merged 1 commit intogit-lfs:masterfrom
bluekeyes:optimize-ref-listing

Conversation

@bluekeyes
Copy link
Contributor

@bluekeyes bluekeyes commented Jun 30, 2020

When repositories have a large number (hundreds of thousands) of refs, Git LFS adds noticeable overhead on the time taken by git show-ref and git ls-remote. To reduce this overhead:

  • Avoid regular expressions when parsing lines of Git command output. Regexp evaluation was the most expensive part of this process in profiling.
  • Minimize the number of loops when computing skipped refs by combining the missing and skipped checks and converting the list of remote refs into a pre-filtered set.

In a repository with ~300k remote references, these changes reduce the time spent in the pre-push command from 8.26s to 6.31s. The individual optimizations were identified and verified with pprof.

I'm happy to break these up into two separate PRs or commits if that's preferred.

git/git.go Outdated
func parseShowRefLine(refPrefix, line string) (sha, name string, ok bool) {
const shaLen = 40

refIdx := strings.Index(line, refPrefix)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used strings.Index here because the original regexp allowed one or more spaces after the SHA. However, the output from show-ref and ls-remote seems well-defined, with a single space or a single tab respectively.

If supporting variations on the format isn't necessary, these two new parse functions might be simplified by looking for the ref at a specific offset in the string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way I'd like to go here is to search for the first space or tab, as appropriate, and not use a hard-coded offset. Git will learn to support SHA-256, which has a longer length, very soon (within the next two releases) and I have a series for Git LFS which removes all of the hard-coded 40 constants here, as well as updating regexes to support it here as well. I'd like to avoid regressing that series if possible, so adding more assumptions about the length of the hash should be avoided.

It's probably safe to rely on the output of Git not changing to add additional spaces or tabs that aren't in the current version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I updated both functions to look for the space/tab separator.

@bluekeyes bluekeyes force-pushed the optimize-ref-listing branch from 50fda14 to b9a5237 Compare July 1, 2020 00:30
Copy link
Member

@bk2204 bk2204 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, I'd like to see fewer assumptions about the length of a hash, since Git will learn about SHA-256 in an upcoming version and we'll need to remove those assumptions soon anyway.

git/git.go Outdated
func parseShowRefLine(refPrefix, line string) (sha, name string, ok bool) {
const shaLen = 40

refIdx := strings.Index(line, refPrefix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way I'd like to go here is to search for the first space or tab, as appropriate, and not use a hard-coded offset. Git will learn to support SHA-256, which has a longer length, very soon (within the next two releases) and I have a series for Git LFS which removes all of the hard-coded 40 constants here, as well as updating regexes to support it here as well. I'd like to avoid regressing that series if possible, so adding more assumptions about the length of the hash should be avoided.

It's probably safe to rely on the output of Git not changing to add additional spaces or tabs that aren't in the current version.

When repositories have a large number (hundreds of thousands) of refs, Git
LFS adds noticeable overhead on the time taken by `git show-ref` and
`git ls-remote`. To reduce this overhead:

* Avoid regular expressions when parsing lines of Git command output.
  Regexp evaluation was the most expensive part of this process in
  profiling.

* Minimize the number of loops when computing skipped refs by combining
  the missing and skipped checks and converting the list of remote refs
  into a pre-filtered set.

In a repository with ~300k remote references, these changes reduce the
time spent in the pre-push command from 8.26s to 6.31s.
@bluekeyes bluekeyes force-pushed the optimize-ref-listing branch from b9a5237 to 5a7569a Compare July 2, 2020 05:52
@bk2204 bk2204 merged commit 99c45ab into git-lfs:master Jul 6, 2020
@bk2204
Copy link
Member

bk2204 commented Jul 6, 2020

Thanks for the patch and the prompt turnaround on revising it!

@bluekeyes bluekeyes deleted the optimize-ref-listing branch July 6, 2020 19:27
bk2204 added a commit to bk2204/git-lfs that referenced this pull request Jan 12, 2022
When we cache files, do so on the full path instead of just the
directory entry.  This means that when we have an identical file with
the same name in two different direectories, we distinguish between the
two paths and ensure both are added to .gitattributes.

This is an alternate solution to git-lfs#4176 which should perform better.  For
compmarison, with a clone of Git's main repository with the following
command, we get:

git lfs migrate import --everything --include="*.h":

* v3.0.1      (broken):  608s user,   53s system,    5:34 total
* v3.0.2      (fixed): 13435s user, 1255s system, 1:43:17 total
* this commit (fixed):   716s user,   67s system,    6:59 total

This is a much better performance characteristic for equivalent results.

Preserve the integration from the earlier attempt at fixing this plus
add an additional one.  Avoid using assert_pointer in the new test
because that helper doesn't always work correctly when there are two
files with the same file name.
bk2204 added a commit to bk2204/git-lfs that referenced this pull request Jan 13, 2022
When we cache files, do so on the full path instead of just the
directory entry.  This means that when we have an identical file with
the same name in two different direectories, we distinguish between the
two paths and ensure both are added to .gitattributes.

This is an alternate solution to git-lfs#4176 which should perform better.  For
compmarison, with a clone of Git's main repository with the following
command, we get:

git lfs migrate import --everything --include="*.h":

* v3.0.1      (broken):  608s user,   53s system,    5:34 total
* v3.0.2      (fixed): 13435s user, 1255s system, 1:43:17 total
* this commit (fixed):   716s user,   67s system,    6:59 total

This is a much better performance characteristic for equivalent results.

Preserve the integration from the earlier attempt at fixing this plus
add an additional one.  Avoid using assert_pointer in the new test
because that helper doesn't always work correctly when there are two
files with the same file name.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants