Skip to content

Reworked to use git ls-files in some circumstances instead of FastWalkGitRepo#3823

Merged
bk2204 merged 2 commits into
git-lfs:masterfrom
sudonym1:master
Oct 1, 2019
Merged

Reworked to use git ls-files in some circumstances instead of FastWalkGitRepo#3823
bk2204 merged 2 commits into
git-lfs:masterfrom
sudonym1:master

Conversation

@sudonym1

@sudonym1 sudonym1 commented Sep 17, 2019

Copy link
Copy Markdown

This is a rough cut of addressing #3797 by replacing FastWalkGitRepo with git ls-files. For my use case, it cuts post-hook execution time down from ~40 seconds, to a few hundred ms.

At this point, I am looking for a high-level review of the changes, and direction on wether I should proceed w/ completing this change.

There are a few things that need to be addressed:

  1. The new code needs tests.
  2. Some existing tests are failing.
  3. The new behavior needs to be verified against the old behavior on more repos.

I am also requesting any feedback that anyone familiar w/ the code might have. As it stands, git-lfs is unusable for our project without this (or a similar) fix, and I would like to verify that this fix is sound before we use it internally.

@sudonym1 sudonym1 changed the title Reworked to us git ls-files in some circumstances Reworked to use git ls-files in some circumstances instead of FastWalkGitRepo Sep 17, 2019
@bk2204

bk2204 commented Sep 17, 2019

Copy link
Copy Markdown
Member

The approach I'm taking, which I'm still waiting on feedback on, is the ls-files-filter branch on bk2204/git-lfs. It does pass the testsuite, although it has the limitations that it doesn't handle the case where users lock ignored files and it doesn't handle files with newlines (which we don't handle anyway at the moment).

The general approach of using git ls-files is definitely sound, but I'm not sure how it performs on large repos. I'm not opposed to taking a patch that you write up if you like, although if you're happy with my patch, it may be easier for me to rework and submit.

@sudonym1

Copy link
Copy Markdown
Author

RE large repos: I think the limiting factor is going to be the size of the working tree, and the complexity of the gitignores. To that end, our repo is the result of an SVN conversion, and it is fairly large at 278,000 files, with a gitignore over 3k lines.

I will take a look at your changes, and see where we differ. If you are open to accepting a PR, I will keep working on this.

@sudonym1

sudonym1 commented Sep 17, 2019

Copy link
Copy Markdown
Author

I spent a minute and reviewed your change. It looks pretty similar to mine, except you are hitting the filesystem a little more.

I think I need to understand a little more:

  1. When searching for .gitattributes: What are the requirements here? AFAIU, we should be looking for all git attributes files that are either staged, or committed to the repository. Is that correct? What about .gitattributes files that have been created but not yet staged?
  2. Fixing up file locks: Does a file have to be staged/commited to be "locked"? If so, could the calls to stat be eliminated? For reference, it is about 10x faster if ls-files doesn't have to traverse the work tree.
  3. In your change, you are now stepping through the output of ls-files one at at time and calling the callbacks sequentially. What do you suppose the performance impact of this will be on querying the server for file locks?

@bk2204

bk2204 commented Sep 17, 2019

Copy link
Copy Markdown
Member

For 1, we need to honor .gitattributes files that are in the tree but not yet staged. We have tests that check for this case and require it to work.

For 2, it isn't practically useful to lock a file that isn't yet committed, since nobody else will be contending over it. I don't think we have to support that case.

For 3, I don't believe we're querying the server if you're talking about the code in locking/lockable.go. There, we're just looking at the .gitattributes file for the lockable attribute and setting those files read only.

I don't think we intrinsically need to call stat in this case. I can imagine an additional commit on top of my change that instead of passing an os.FileInfo, changes the code to never run the callback on directories and instead passes just the name of the file in place of that argument. I don't believe we actually care about the directories in any case where we traverse and it would be cheaper, as you point out.

@sudonym1

Copy link
Copy Markdown
Author

I ran some trivial bench marking on the two change sets:

# With your change set, including all of the calls to stat.
$ time git checkout master
Already on 'master'
Your branch is up to date with 'origin/master'.

real    0m6.902s
user    0m2.746s
sys     0m2.729s

# With my change set
$ time git checkout master
Already on 'master'
Your branch is up to date with 'origin/master'.

real    0m0.673s
user    0m0.716s
sys     0m0.553s

My guess is that the majority of the difference is the calls to stat. I will rework my change so the test cases pass. My guess is it will end up looking very similar to your change, except for the calls to stat, and the NUL separated output from git ls-files.

@sudonym1 sudonym1 force-pushed the master branch 2 times, most recently from 14ade2b to d542f07 Compare September 18, 2019 00:40
Due to the complexities of handling .gitignores, rely directly on
git-ls-files rather than a custom gitignore-aware directory walking
routine.

As a side-effect, there has been a minor behaviorial change related to
locking. Previously, if a file was locked, then a matching pattern was
added to a .gitignore, that file was no longer considered locked and the
writeable attribute was set on that file. With this change, files which
match a .gitignore, but have been addded to the repository,
will still have their write bit cleared.

Fixes git-lfs#3797
@sudonym1

Copy link
Copy Markdown
Author

Updated benchmark:

$ time git checkout master
Already on 'master'
Your branch is up to date with 'origin/master'.

real    0m1.741s
user    0m1.568s
sys     0m0.692s

As expected, it is faster than running stat on all of the files, but not as fast as the initial version because it has to scan the working directory (git ls-files --other).

@sudonym1

Copy link
Copy Markdown
Author

Not sure why the windows ci build is failing. Is that expected?

@slonopotamus

Copy link
Copy Markdown
Contributor

I believe there's something broken with Windows CI. Reported as #3828.

@slonopotamus

Copy link
Copy Markdown
Contributor

@SeamusConnor Please, rebase onto latest master to make sure tests pass on Windows.

@sudonym1

Copy link
Copy Markdown
Author

Alright, looks like whatever was done to the windows CI build on master fixed my issue.

@bk2204 bk2204 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, sorry it's taken me so long to get to this.

Overall, this looks like a great improvement. I have some concerns about the memory usage with the technique implemented here, especially on large repositories. I think using a goroutine approach to return items incrementally may be a better approach.

Comment thread git/ls_files.go
FullPath: scanner.Text(),
}
rv.Files[scanner.Text()] = finfo
rv.FilesByName[base] = append(rv.FilesByName[base], finfo)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach has the real possibility of using a lot of memory on large repositories, since we're storing data twice for each file instead of operating incrementally. I think we'd want to continue to do things incrementally like we do for the current walker.

@sudonym1 sudonym1 Oct 1, 2019

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I considered that when I did it. For my repo (which IMO is quite large w/ a working tree of ~300k files), ls-files -z -o --cached produced around 20 MiB of data. I made a completely untested tradeoff for speed here, with the assumption that a callback or channel based approach would be slower. Perhaps the most prudent thing to do would be for me to try it with an iterative interface, and see what happens to the execution time/memory consumption?
Anyway, that was the thought process. I could go either way. Just let me know.

@slonopotamus slonopotamus Oct 1, 2019

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Largest git installation I worked on was 1M files with 1.5TB .git/objects (before git-lfs existed) and this was much beyond what any sane person would do. So if what you did works reasonably well for 300K files, I believe that's Good Enough to leave as is.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's go with it. It's going to scale better than the current status quo, and it's not too difficult to fix if it doesn't.

@bk2204 bk2204 merged commit f45574a into git-lfs:master Oct 1, 2019
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request Mar 16, 2026
In commit 13a8af6 of PR git-lfs#1616 we
added a FastWalkGitRepo() function to our "tools" package for the
purpose of improving the performance of the "git lfs track" command.
At the time, this command used the Walk() function of the "filepath"
package from the Go standard library to traverse the contents of the
current Git working tree and locate all ".gitattributes" files.

Then in commit f1fdc85 of the same
PR git-lfs#1616 we updated the "git lfs track" command's findAttributeFiles()
function to use our new FastWalkGitRepo() function instead of the
"filepath" package's Walk() function.  This change made searches for
".gitattributes" files in large repositories faster for several reasons.
Unlike the Walk() function from the "filepath" package, the functions
called by the FastWalkGitRepo() function to traverse a directory
hierarchy did not sort the entries in each directory, and also ignored
all ".git" directories and all entries which matched any pattern found
in a ".gitignore" file.

Later, in PRs git-lfs#1870 and git-lfs#2689, we expanded the number of callers of
the FastWalkGitRepo() function.  In particular, in PR git-lfs#2689 we began
to make use of the function during the final phase of all Git LFS
commands where we find and delete any stale temporary files stored in
our ".lfs/tmp" directory.  This PR introduced our "fs" package whose
cleanupTmp() method calls the FastWalkGitRepo() function, passing
the path to the ".lfs/tmp" directory and an anonymous callback function
which removes any temporary object data files that are more than an
hour old.

We then added a FastWalkGitRepoAll() function to our "fs" package in
PR git-lfs#3190, which operated in a similar fashion as the FastWalkGitRepo()
function but did not read ".gitignore" files and so also did not skip
directory entries matching any patterns found in those files.

Next, in PR git-lfs#3686, we updated the internal implementation of the
FastWalkGitRepo() and FastWalkGitRepoAll() functions to avoid entering
submodules when traversing a Git working tree.  To make this change,
we added a check to the Walk() method of the "fastWalker" structure
so it would return immediately when processing a directory if the
directory contained an entry named ".git", unless the directory was
the root of the working tree.  Note that the Walk() method already
ignored any directory entries with the name ".git", but this only
meant it would traverse through the contents of a submodule checkout
in a working tree while skipping the submodule's ".git" directory.

Then in commit 83d7f76 of PR git-lfs#3823
we first implemented the NewLsFiles() function in our "git" package,
which executes a "git ls-files" command and returns list of files it
outputs.  As well, we updated the findAttributeFiles() function of our
"git" package and the fixFileWriteFlags() function in our "locking"
package to both call the NewLsFiles() function instead of calling
either the FastWalkGitRepo() or FastWalkGitRepoAll() functions.

Since these were the only instances where we actually used the
FastWalkGitRepo() or FastWalkGitRepoAll() functions to traverse a Git
working tree, in the same commit of PR git-lfs#3823 we removed the
FastWalkGitRepo() function and renamed the FastWalkGitRepoAll() function
to FastWalkDir().  We also simplified some of the internal functions
called by the FastWalkDir() function because they no longer needed to
detect or parse ".gitignore" files, or skip directory entries matching
the patterns from those files.

Although the two remaining use cases for the FastWalkDir() function
also did not need the function to detect and skip submodules or
directories named ".git", the logic to do so was retained in the
internal functions of the "tools" package.

Specifically, the fastWalkWithExcludeFiles() function, which is called
by the FastWalkDir() function, establishes two file path filters with
the patterns ".git" and "**/.git", which are then passed down to the
Walk() method of the "fastWalker" structure.  That method checks whether
the current directory entry's name matches either of the filter patterns,
and if it does returns immediately without processing the entry any
further.  In addition, the Walk() method still also performs the check
added in PR git-lfs#3686 to try to avoid traversing into Git submodules.

As noted above, however, neither of the two remaining callers of the
FastWalkDir() function require these checks, because they only need
the function to traverse directory hierarchies within the ".git/lfs"
directory.

The cleanupTmp() method of the Filesystem structure in our "fs" package
uses the FastWalkDir() function to find stale temporary files within the
".git/lfs/tmp" directory.  The EachObject() method of the same structure,
meanwhile, uses the FastWalkDir() function to invoke a callback function
for each object file under the ".git/lfs/objects" directory.

In both cases, the directory hierarchy traversed by the FastWalkDir()
function is entirely within the ".git/lfs" directory, so there is no
value in trying to exclude ".git" directories while performing the
traversal.

We therefore now simplify the FastWalkDir() function and the internal
functions it invokes by removing the unnecessary checks for submodules
and for directory entries named ".git".

First, we update the fastWalkWithExcludeFiles() function so that does
not initialize any file path filters, and we rename the function to
fastWalkDir().

Next, we remove the Walk() method's "excludePaths" parameter, and alter
the method so that it no longer skips directory entries based on whether
their names match the patterns in that parameter's file path filters.

We also eliminate the check in the Walk() method for directories
containing ".git" directory entries, as this check's only purpose was
to skip Git submodules within a working tree.

As well, we revise the code comments relating to all of these functions
and methods to reflect their new names and their simplified behaviours.
To minimize the changes in this commit, however, we leave the names of
the functions' parameters and internal variables intact, even though
some of them still reflect the original design and the expectation that
the functions would be used with Git working trees.  In a subsequent
commit in this PR we will then rename these variables and parameters,
along with the "rootDir" field of the "fastWalker" structure, so that
they more accurately represent the functions' current purpose and
implementation.

Finally, note that the "cleans only temp files and directories older
than an hour" test in our "t/t-tempfile.sh" shell test script verifies
the behaviour of cleanupTmp() function, which employs the FastWalkDir()
function, while the TestFastWalkBasic() test function in our Go test
suite directly exercises the fastWalkDir() internal function.  Both
tests thus provide some assurance that our changes in this commit have
not introduced any unexpected regressions.
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request Mar 16, 2026
In commit 13a8af6 of PR git-lfs#1616 we
added a FastWalkGitRepo() function to our "tools" package for the
purpose of improving the performance of the "git lfs track" command.
At the time, this command used the Walk() function of the "filepath"
package from the Go standard library to traverse the contents of the
current Git working tree and locate all ".gitattributes" files.

Then in commit f1fdc85 of the same
PR git-lfs#1616 we updated the "git lfs track" command's findAttributeFiles()
function to use our new FastWalkGitRepo() function instead of the
"filepath" package's Walk() function.  This change made searches for
".gitattributes" files in large repositories faster for several reasons.
Unlike the Walk() function from the "filepath" package, the functions
called by the FastWalkGitRepo() function to traverse a directory
hierarchy did not sort the entries in each directory, and also ignored
all ".git" directories and all entries which matched any pattern found
in a ".gitignore" file.

Later, in PRs git-lfs#1870 and git-lfs#2689, we expanded the number of callers of
the FastWalkGitRepo() function.  In particular, in PR git-lfs#2689 we began
to make use of the function during the final phase of all Git LFS
commands where we find and delete any stale temporary files stored in
our ".lfs/tmp" directory.  This PR introduced our "fs" package whose
cleanupTmp() method calls the FastWalkGitRepo() function, passing
the path to the ".lfs/tmp" directory and an anonymous callback function
which removes any temporary object data files that are more than an
hour old.

We then added a FastWalkGitRepoAll() function to our "fs" package in
PR git-lfs#3190, which operated in a similar fashion as the FastWalkGitRepo()
function but did not read ".gitignore" files and so also did not skip
directory entries matching any patterns found in those files.

Next, in PR git-lfs#3686, we updated the internal implementation of the
FastWalkGitRepo() and FastWalkGitRepoAll() functions to avoid entering
submodules when traversing a Git working tree.  To make this change,
we added a check to the Walk() method of the "fastWalker" structure
so it would return immediately when processing a directory if the
directory contained an entry named ".git", unless the directory was
the root of the working tree.  Note that the Walk() method already
ignored any directory entries with the name ".git", but this only
meant it would traverse through the contents of a submodule checkout
in a working tree while skipping the submodule's ".git" directory.

Then in commit 83d7f76 of PR git-lfs#3823
we first implemented the NewLsFiles() function in our "git" package,
which executes a "git ls-files" command and returns list of files it
outputs.  As well, we updated the findAttributeFiles() function of our
"git" package and the fixFileWriteFlags() function in our "locking"
package to both call the NewLsFiles() function instead of calling
either the FastWalkGitRepo() or FastWalkGitRepoAll() functions.

Since these were the only instances where we actually used the
FastWalkGitRepo() or FastWalkGitRepoAll() functions to traverse a Git
working tree, in the same commit of PR git-lfs#3823 we removed the
FastWalkGitRepo() function and renamed the FastWalkGitRepoAll() function
to FastWalkDir().  We also simplified some of the internal functions
called by the FastWalkDir() function because they no longer needed to
detect or parse ".gitignore" files, or skip directory entries matching
the patterns from those files.

Although the two remaining use cases for the FastWalkDir() function
also did not need the function to detect and skip submodules or
directories named ".git", the logic to do so was retained in the
internal functions of the "tools" package.

Specifically, the fastWalkWithExcludeFiles() function, which is called
by the FastWalkDir() function, establishes two file path filters with
the patterns ".git" and "**/.git", which are then passed down to the
Walk() method of the "fastWalker" structure.  That method checks whether
the current directory entry's name matches either of the filter patterns,
and if it does returns immediately without processing the entry any
further.  In addition, the Walk() method still also performs the check
added in PR git-lfs#3686 to try to avoid traversing into Git submodules.

As noted above, however, neither of the two remaining callers of the
FastWalkDir() function require these checks, because they only need
the function to traverse directory hierarchies within the ".git/lfs"
directory.

The cleanupTmp() method of the Filesystem structure in our "fs" package
uses the FastWalkDir() function to find stale temporary files within the
".git/lfs/tmp" directory.  The EachObject() method of the same structure,
meanwhile, uses the FastWalkDir() function to invoke a callback function
for each object file under the ".git/lfs/objects" directory.

In both cases, the directory hierarchy traversed by the FastWalkDir()
function is entirely within the ".git/lfs" directory, so there is no
value in trying to exclude ".git" directories while performing the
traversal.

We therefore now simplify the FastWalkDir() function and the internal
functions it invokes by removing the unnecessary checks for submodules
and for directory entries named ".git".

First, we update the fastWalkWithExcludeFiles() function so that does
not initialize any file path filters, and we rename the function to
fastWalkDir().

Next, we remove the Walk() method's "excludePaths" parameter, and alter
the method so that it no longer skips directory entries based on whether
their names match the patterns in that parameter's file path filters.

We also eliminate the check in the Walk() method for directories
containing ".git" directory entries, as this check's only purpose was
to skip Git submodules within a working tree.

As well, we revise the code comments relating to all of these functions
and methods to reflect their new names and their simplified behaviours.
To minimize the changes in this commit, however, we leave the names of
the functions' parameters and internal variables intact, even though
some of them still reflect the original design and the expectation that
the functions would be used with Git working trees.  In a subsequent
commit in this PR we will then rename these variables and parameters,
along with the "rootDir" field of the "fastWalker" structure, so that
they more accurately represent the functions' current purpose and
implementation.

Finally, note that the "cleans only temp files and directories older
than an hour" test in our "t/t-tempfile.sh" shell test script verifies
the behaviour of cleanupTmp() function, which employs the FastWalkDir()
function, while the TestFastWalkBasic() test function in our Go test
suite directly exercises the fastWalkDir() internal function.  Both
tests thus provide some assurance that our changes in this commit have
not introduced any unexpected regressions.
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request Mar 17, 2026
In commit 13a8af6 of PR git-lfs#1616 we
added a FastWalkGitRepo() function to our "tools" package for the
purpose of improving the performance of the "git lfs track" command.
At the time, this command used the Walk() function of the "filepath"
package from the Go standard library to traverse the contents of the
current Git working tree and locate all ".gitattributes" files.

Then in commit f1fdc85 of the same
PR git-lfs#1616 we updated the "git lfs track" command's findAttributeFiles()
function to use our new FastWalkGitRepo() function instead of the
"filepath" package's Walk() function.  This change made searches for
".gitattributes" files in large repositories faster for several reasons.
Unlike the Walk() function from the "filepath" package, the functions
called by the FastWalkGitRepo() function to traverse a directory
hierarchy did not sort the entries in each directory, and also ignored
all ".git" directories and all entries which matched any pattern found
in a ".gitignore" file.

Later, in PRs git-lfs#1870 and git-lfs#2689, we expanded the number of callers of
the FastWalkGitRepo() function.  In particular, in PR git-lfs#2689 we began
to make use of the function during the final phase of all Git LFS
commands where we find and delete any stale temporary files stored in
our ".lfs/tmp" directory.  This PR introduced our "fs" package whose
cleanupTmp() method calls the FastWalkGitRepo() function, passing
the path to the ".lfs/tmp" directory and an anonymous callback function
which removes any temporary object data files that are more than an
hour old.

We then added a FastWalkGitRepoAll() function to our "fs" package in
PR git-lfs#3190, which operated in a similar fashion as the FastWalkGitRepo()
function but did not read ".gitignore" files and so also did not skip
directory entries matching any patterns found in those files.

Next, in PR git-lfs#3686, we updated the internal implementation of the
FastWalkGitRepo() and FastWalkGitRepoAll() functions to avoid entering
submodules when traversing a Git working tree.  To make this change,
we added a check to the Walk() method of the "fastWalker" structure
so it would return immediately when processing a directory if the
directory contained an entry named ".git", unless the directory was
the root of the working tree.  Note that the Walk() method already
ignored any directory entries with the name ".git", but this only
meant it would traverse through the contents of a submodule checkout
in a working tree while skipping the submodule's ".git" directory.

Then in commit 83d7f76 of PR git-lfs#3823
we first implemented the NewLsFiles() function in our "git" package,
which executes a "git ls-files" command and returns list of files it
outputs.  As well, we updated the findAttributeFiles() function of our
"git" package and the fixFileWriteFlags() function in our "locking"
package to both call the NewLsFiles() function instead of calling
either the FastWalkGitRepo() or FastWalkGitRepoAll() functions.

Since these were the only instances where we actually used the
FastWalkGitRepo() or FastWalkGitRepoAll() functions to traverse a Git
working tree, in the same commit of PR git-lfs#3823 we removed the
FastWalkGitRepo() function and renamed the FastWalkGitRepoAll() function
to FastWalkDir().  We also simplified some of the internal functions
called by the FastWalkDir() function because they no longer needed to
detect or parse ".gitignore" files, or skip directory entries matching
the patterns from those files.

Although the two remaining use cases for the FastWalkDir() function
also did not need the function to detect and skip submodules or
directories named ".git", the logic to do so was retained in the
internal functions of the "tools" package.

Specifically, the fastWalkWithExcludeFiles() function, which is called
by the FastWalkDir() function, establishes two file path filters with
the patterns ".git" and "**/.git", which are then passed down to the
Walk() method of the "fastWalker" structure.  That method checks whether
the current directory entry's name matches either of the filter patterns,
and if it does returns immediately without processing the entry any
further.  In addition, the Walk() method still also performs the check
added in PR git-lfs#3686 to try to avoid traversing into Git submodules.

As noted above, however, neither of the two remaining callers of the
FastWalkDir() function require these checks, because they only need
the function to traverse directory hierarchies within the ".git/lfs"
directory.

The cleanupTmp() method of the Filesystem structure in our "fs" package
uses the FastWalkDir() function to find stale temporary files within the
".git/lfs/tmp" directory.  The EachObject() method of the same structure,
meanwhile, uses the FastWalkDir() function to invoke a callback function
for each object file under the ".git/lfs/objects" directory.

In both cases, the directory hierarchy traversed by the FastWalkDir()
function is entirely within the ".git/lfs" directory, so there is no
value in trying to exclude ".git" directories while performing the
traversal.

We therefore now simplify the FastWalkDir() function and the internal
functions it invokes by removing the unnecessary checks for submodules
and for directory entries named ".git".

First, we update the fastWalkWithExcludeFiles() function so that does
not initialize any file path filters, and we rename the function to
fastWalkDir().

Next, we remove the Walk() method's "excludePaths" parameter, and alter
the method so that it no longer skips directory entries based on whether
their names match the patterns in that parameter's file path filters.

We also eliminate the check in the Walk() method for directories
containing ".git" directory entries, as this check's only purpose was
to skip Git submodules within a working tree.

As well, we revise the code comments relating to all of these functions
and methods to reflect their new names and their simplified behaviours.
To minimize the changes in this commit, however, we leave the names of
the functions' parameters and internal variables intact, even though
some of them still reflect the original design and the expectation that
the functions would be used with Git working trees.  In a subsequent
commit in this PR we will then rename these variables and parameters,
along with the "rootDir" field of the "fastWalker" structure, so that
they more accurately represent the functions' current purpose and
implementation.

Finally, note that the "cleans only temp files and directories older
than an hour" test in our "t/t-tempfile.sh" shell test script verifies
the behaviour of cleanupTmp() function, which employs the FastWalkDir()
function, while the TestFastWalkBasic() test function in our Go test
suite directly exercises the fastWalkDir() internal function.  Both
tests thus provide some assurance that our changes in this commit have
not introduced any unexpected regressions.
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request Mar 17, 2026
In commit 13a8af6 of PR git-lfs#1616 we
added a FastWalkGitRepo() function to our "tools" package for the
purpose of improving the performance of the "git lfs track" command.
At the time, this command used the Walk() function of the "filepath"
package from the Go standard library to traverse the contents of the
current Git working tree and locate all ".gitattributes" files.

Then in commit f1fdc85 of the same
PR git-lfs#1616 we updated the "git lfs track" command's findAttributeFiles()
function to use our new FastWalkGitRepo() function instead of the
"filepath" package's Walk() function.  This change made searches for
".gitattributes" files in large repositories faster for several reasons.
Unlike the Walk() function from the "filepath" package, the functions
called by the FastWalkGitRepo() function to traverse a directory
hierarchy did not sort the entries in each directory, and also ignored
all ".git" directories and all entries which matched any pattern found
in a ".gitignore" file.

Later, in PRs git-lfs#1870 and git-lfs#2689, we expanded the number of callers of
the FastWalkGitRepo() function.  In particular, in PR git-lfs#2689 we began
to make use of the function in our "git lfs prune" command to locate
all the object files in our local storage directories, which by default
are located within the ".git/lfs/objects" directory.  As well, with
this PR we began to use the FastWalkGitRepo() function during the
final phase of all Git LFS commands where we find and delete any
stale temporary files from the directory where we store such files,
which by default is the ".git/lfs/tmp" directory.

PR git-lfs#2689 introduced our "fs" package, whose Filesystem structure's
EachObject() and cleanupTmp() methods both called the FastWalkGitRepo()
function, passing the paths to the local object storage directory
and the local temporary file storage directory, respectively.

We then added a FastWalkGitRepoAll() function to our "fs" package in
PR git-lfs#3190.  This function operated in a similar fashion as the
FastWalkGitRepo() function but did not read ".gitignore" files and so
also did not skip directory entries matching any patterns found in
those files.

Next, in PR git-lfs#3686, we updated the internal implementation of the
FastWalkGitRepo() and FastWalkGitRepoAll() functions to avoid entering
submodules when traversing a Git working tree.  To make this change,
we added a check to the "fastWalker" structure's Walk() method so it
would return immediately when processing a directory if the directory
contained an entry named ".git", unless the directory was the root of
the working tree.  Note that the Walk() method already ignored any
directory entries with the name ".git", but this only meant it would
traverse through the contents of a submodule checkout in a working
tree while skipping the submodule's ".git" directory.

Then in commit 83d7f76 of PR git-lfs#3823
we first implemented the NewLsFiles() function in our "git" package,
which executes a "git ls-files" command and returns the list of files
it outputs.  As well, we updated the findAttributeFiles() function of
our "git" package and the fixFileWriteFlags() function in our "locking"
package to both call the NewLsFiles() function instead of calling
either the FastWalkGitRepo() or FastWalkGitRepoAll() functions.

Since these were the only instances where we actually used the
FastWalkGitRepo() or FastWalkGitRepoAll() functions to traverse a Git
working tree, in the same commit of PR git-lfs#3823 we removed the
FastWalkGitRepo() function and renamed the FastWalkGitRepoAll() function
to FastWalkDir().  We also simplified some of the internal functions
called by the FastWalkDir() function because they no longer needed to
detect or parse ".gitignore" files, or skip directory entries matching
the patterns from those files.

Although the two remaining use cases for the FastWalkDir() function
also did not need the function to detect and skip submodules or
directories named ".git", the logic to do so was retained in the
internal functions of the "tools" package.

Specifically, the fastWalkWithExcludeFiles() function, which is called
by the FastWalkDir() function, establishes two file path filters with
the patterns ".git" and "**/.git", which are then passed down to the
Walk() method of the "fastWalker" structure.  That method checks whether
the current directory entry's name matches either of the filter patterns,
and if it does returns immediately without processing the entry any
further.  In addition, the Walk() method still also performs the check
added in PR git-lfs#3686 to try to avoid traversing into Git submodules.

As noted above, however, neither of the two remaining callers of the
FastWalkDir() function require these checks, because they only need
the function to traverse directory hierarchies which the Git LFS client
has created, and which are specifically not Git working trees.

The cleanupTmp() method of the Filesystem structure in our "fs" package
uses the FastWalkDir() function to find stale files within the local
temporary file storage directory.  The EachObject() method of the same
structure, meanwhile, uses the FastWalkDir() function to invoke a
callback function for each object file in the local object storage
directory.

By default, these local storage directories are the ".git/lfs/tmp" and
".git/lfs/objects" directories.  If the "lfs.storage" configuration
option is set to a relative path, then these directories will be
located somewhere within the ".git" directory, while if the option is
set to an absolute path, our local storage directories will be located
under that arbitrary location.

In all these cases the Git LFS client creates and manages these local
storage directories, so we can expect them not to contain ".git"
directories or submodules.  This is true even when a user has
configured the "lfs.storage" option with an absolute path, since
the client is still responsible for creating and managing "tmp" and
"objects" directories within that arbitrary location.

We therefore now simplify the FastWalkDir() function and the internal
functions it invokes by removing the unnecessary checks for submodules
and for directory entries named ".git".

First, we update the fastWalkWithExcludeFiles() function so that does
not initialize any file path filters, and we rename the function to
fastWalkDir().

Next, we remove the Walk() method's "excludePaths" parameter, and alter
the method so that it no longer skips directory entries based on whether
their names match the patterns in that parameter's file path filters.

We also eliminate the check in the Walk() method for directories
containing ".git" directory entries, as this check's only purpose was
to skip Git submodules within a working tree.

As well, we revise the code comments relating to all of these functions
and methods to reflect their new names and their simplified behaviours.
To minimize the changes in this commit, however, we leave the names of
the functions' parameters and internal variables intact, even though
some of them still reflect the original design and the expectation that
the functions would be used with Git working trees.  In a subsequent
commit in this PR we will then rename these variables and parameters,
along with the "rootDir" field of the "fastWalker" structure, so that
they more accurately represent the functions' current purpose and
implementation.

Finally, note that the "cleans only temp files and directories older
than an hour" test in our "t/t-tempfile.sh" shell test script verifies
the behaviour of cleanupTmp() function, which employs the FastWalkDir()
function, while the TestFastWalkBasic() test function in our Go test
suite directly exercises the fastWalkDir() internal function.  Both
tests thus provide some assurance that our changes in this commit have
not introduced any unexpected regressions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants