Skip to content

Fast directory walk#1616

Merged
technoweenie merged 24 commits intomasterfrom
fast-directory-walk
Nov 8, 2016
Merged

Fast directory walk#1616
technoweenie merged 24 commits intomasterfrom
fast-directory-walk

Conversation

@sinbad
Copy link
Contributor

@sinbad sinbad commented Nov 1, 2016

Fixes #1611

Provides a number of more optimal replacements for filepath.Walk. The easiest of which is tools.FastWalkGitRepo which automatically avoids walking .git and anything that matches any .gitignore it discovers as it walks.

Lower-level versions are available to include/exclude custom patterns.

I've used this in git lfs track to significantly speed up the enumeration of .gitattributes files, particularly when a repo contains a lot of files, most of which are in the .gitignore, as in #1611.

I've also refactored FilenamePassesIncludeExcludeFilter into the tools package, and extracted a utility function FileMatch which replaces filepath.Match by mirroring the behaviour of gitignore pattern matching, which wasn't completely supported before. This has a side effect of enhancing the capabilities of the include/exclude filters used elsewhere to support a more complete range of wildcard options including the "**" folder match.

@sinbad sinbad added the review label Nov 1, 2016
@ttaylorr
Copy link
Contributor

ttaylorr commented Nov 2, 2016

Looks like the clone (with .lfsconfig) test is at least flaky on Windows, per this run. Going to give this a full review now 👍

Copy link
Contributor

@ttaylorr ttaylorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, few comments. I think the underlying implementation is correct, but the tests should verify that for me 👍

filepath.Walk(config.LocalWorkingDir, func(path string, info os.FileInfo, err error) error {
if err != nil {
return err
fchan, errchan := tools.FastWalkGitRepo(config.LocalWorkingDir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. What do you think about pulling this out to a private helper function in the commands package? I know the logic is duplicated in a few places across this PR, but I'm not so much thinking about that as I want the findAttributeFiles func to read cleanly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only actually used once in production code, the other place is in the tests. Since what this findAttributeFiles function is doing is only hooking out items from the channel that are .gitattributes files it seems to me to make sense like it is, processing the channel and skipping anything else. I certainly don't want another function that builds a full list and then needs to be iterated, the other option is a function which takes search criteria but that's pretty specific and nothing else is using it yet. Maybe if something else needs it later it could be refactored but wrapping it speculatively right now might not actually be re-usable anyway.

}

if len(includePaths) > 0 {
matched := false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could get rid of the above if by changing this line to:

matched := len(includePaths) == 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that conditional guards a lot more than just setting matched := false, in fact matched is deliberate scoped inside that block. It's that way because early-out by lack of inclusion requires checking every item and only concluding at the end, as opposed to early-out by exclusion which can return as soon as it's matched. That's why the exclusion block is simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right; nice catch.

break
}
}
if !matched {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if we keep the if on L139, I don't think we need this one, since we break out whenever matched is true on L144.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand your logic, sorry. You'll need to write me the complete block as you see it because to me that doesn't seem to equate to the same result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, I think we can keep this as-is in the interest of merging this soon.

@sinbad
Copy link
Contributor Author

sinbad commented Nov 2, 2016

Looks like the clone (with .lfsconfig) test is at least flaky on Windows, per this run. Going to give this a full review now 👍

Yeah and yet the very same commit worked fine in the other AppVeyor run. Seems dubious to me, I can't reproduce it.

@sinbad
Copy link
Contributor Author

sinbad commented Nov 7, 2016

Re-tested everything on Windows this morning, totally fine. Looks like it's passing on AppVeyor now too, did someone re-run it? If so thanks, I don't have access to do that and was going to push a pointless change just to force it 😄

Appreciate if someone could approve this, I don't think any changes are needed. @ttaylorr @technoweenie

Copy link
Contributor

@technoweenie technoweenie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I'd just prefer not to export anything in tools that isn't necessary.

// all be of the same type. includePaths/excludePaths can be nil.
func FastWalk(dir string, includePaths, excludePaths []string) (<-chan FastWalkInfo, <-chan error) {
return FastWalkWithExcludeFiles(dir, "", includePaths, excludePaths)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this func? Looks like it's only used in tests, while FastWalkGitRepo() is used in code.

@sinbad
Copy link
Contributor Author

sinbad commented Nov 7, 2016

Since they’re all related variations of the same functionality it seemed sensible to expose them all in case it was useful to re-use them, but can be hidden if you prefer.

On 7 Nov 2016, at 18:55, risk danger olson notifications@github.com wrote:

@technoweenie approved this pull request.

This looks good. I'd just prefer not to export anything in tools that isn't necessary.

In tools/filetools.go:

  • Info os.FileInfo
    +}

+// FastWalk is a more optimal implementation of filepath.Walk
+// It differs in the following ways:
+// * Provides a channel of information instead of using a callback func
+// * Uses goroutines to parallelise large dirs and descent into subdirs
+// * Does not provide sorted output; parents will always be before children but
+// there are no other guarantees. Use parentDir in the FastWalkInfo struct to
+// determine absolute path rather than tracking it yourself like filepath.Walk
+// * Supports include / exclude filters
+// Both dir and include/exclude paths can be relative or absolute, but they must
+// all be of the same type. includePaths/excludePaths can be nil.
+func FastWalk(dir string, includePaths, excludePaths []string) (<-chan FastWalkInfo, <-chan error) {

  • return FastWalkWithExcludeFiles(dir, "", includePaths, excludePaths)
    +}
    Do we need this func? Looks like it's only used in tests, while FastWalkGitRepo() is used in code.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

Copy link
Contributor

@ttaylorr ttaylorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking awesome. Few minor nits here and there, otherwise I think this is ready to merge.


// Main recursive implementation of fast walk
// Increment waitg.Add(1) for each new goroutine launched internally
func fastWalkItem(parentDir string, itemFi os.FileInfo, excludeFilename string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method's name is confusing, mainly because it does not specify what an item is in context. Is an item a directory? A file? Something else?

If you could specify what that refers to specifically, and improve the documentation to point out what some of the parameters mean (in particular, I'm thinking of itemFi and fiChan) that would be 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is an item a directory? A file? Something else?

It's both, that's why I used Item and not File or Dir. Guess it could be fastWalkFileOrDir

return
}
defer df.Close()
jobSize := 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is effectively a constant. Would you mind extracting it and giving it a name outside of this method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why bother, unless something else needs to use the same size?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured that might be a good place to document why the number is what it is and what it's used for. I suppose that can be done inline, but that's just my preference 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't want to pollute the namespace, it's only a named var at all to self-document what it's for.

for children, err := df.Readdir(jobSize); err == nil; children, err = df.Readdir(jobSize) {
// Parallelise all dirs, and chop large dirs into batches
waitg.Add(1)
go func(subitems []os.FileInfo) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clean. Love it!

// If any changes are made a copy of the array is taken so the original is not
// modified
func loadExcludeFilename(filename, parentDir string, excludePaths []string) ([]string, error) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please remove this newline.

defer f.Close()

retPaths := excludePaths
modified := false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these variables that hold the zero-value of their type, I prefer the:

var modified bool

declaration, but that's just a stylistic preference, which you should feel free to ignore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the way it is, lol. If this is later changed to require a default value of true, then the diff of that update would just be:

-modified := false
+modified := true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I prefer explicitness of defaults, but maybe just because I come from languages that don't guarantee initialised variables

@technoweenie
Copy link
Contributor

I think FastWalk can be removed, and FastWalkWithExcludeFiles should not be exported.

@sinbad
Copy link
Contributor Author

sinbad commented Nov 8, 2016

All done - CLA check seems to be stalled though?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants