Make fsck able to check for invalid pointers#4525
Conversation
chrisd8088
left a comment
There was a problem hiding this comment.
Thank you again for another great addition to the feature set for v3.0! I only had minor questions, mostly just about comments and other fiddly whatnots; nothing very substantive. This new functionality is awesome.
488d66c to
8dbfa72
Compare
|
As with the other PR, I've added some squash and fixup commits on the end and will squash them down once you're happy with the result. |
chrisd8088
left a comment
There was a problem hiding this comment.
This is looking really great, thank you!! All of my remaining comments are in the category of things I'm not certain are necessarily things which need to be changed (plus a few which are just problems with legacy comments or tests that I noticed in passing).
I did wonder in particular about two possible renamings, of the new git/scanner_test.go to git/ls_tree_scanner_test.go, and of runScanTreeForInvalidPointers() and catFileBatchTreeForInvalidPointers() to drop the Invalid adjective, if that's appropriate.
Again, thanks so much for this, and my apologies for the slow and disjointed reviews.
There's definitely nothing to apologize for. I appreciate your thorough and detailed reviews and the care and attention you've paid to my recent series, despite the fact that there's a lot of code to look through. Our codebase and future maintainers (or future we) will be better off for it. |
We want to add support for checking pointer files with git lfs fsck. In order to do so, let's add a flag to the pointer file that indicates whether it's canonical: that is, whether serializing the file in the expected way produces exactly the same data as the input.
Right now, the empty file is its own pointer, but our pointer parser doesn't consider this to be a pointer. We'll fix that in a future change, but for now, that when we parse these files, we don't fail if the object representing the empty SHA-256 value is missing on disk, since we don't need it to be present.
Right now, our code to parse pointers doesn't handle empty files, which are their own pointer. However, in the future, it will learn to do so. When we do this, we don't want to upload or download the object with this OID, so let's special-case this code path to avoid attempting to transfer these objects over the network.
The empty string is its own pointer, so let's parse it as one. Instead of returning an error when the input string is empty, we emit a pointer file with the correct OID and size. The code already handles emitting such a pointer correctly.
Currently, Git LFS tries very hard to ignore Git files that are not pointers. However, we'd like to add a mode to git lfs fsck where people can scan for invalid pointers so they can check for problems in CI. Let's add an error type that we can use to report these errors and extract the relevant data from them to provide to the user.
Using a simple exclamation point in front of a command inverts the sense of the command, but it also disables the error checking of "set -e". As a result, let's use "&& exit 1 || true" since that will cause error checking to be enabled.
Currently we process pointers in a rather lax way and accept a variety of non-canonical pointers, such as those with CRLF line endings or a size value of 0 (which should be an empty file instead). However, we have a flag to indicate whether the pointer is canonical when parsing it, so let's add a --strict flag so that we can allow people to reject non-canonical pointers. Note that while strictness is not the default, we add a --no-strict flag as well so we can allow people to use it now and change the default at a later time (probably 4.0).
In the future, we'll add additional types of fsck in addition to checking objects. Let's move the existing code into a function so we can call several different functions to check other types of things about our repository.
Currently, we only output messages related to checking the objects present on the local system. However, in the future, we'll add several other different checks to git lfs fsck. In order to make that work, let's switch to an output format that can be more easily parsed by commands.
We're going to need to scan trees with ls-tree in the git package in the future, and we can't call into the lfs package because of import loops, so let's move the scanner to the git package. While we're at it, let's make two important changes. First, let's remove the blob size check, since we're going to want this functionality in order to read all blobs, not just small ones. As part of that, move that check into the place where we use the output of the scanner so we don't lose this check. The other check is to change the name Sha1 to Oid, since we now support SHA-256 repos as well as SHA-1 repos. Move the tests and some of the helper functions to the new package as well.
Right now, this function lets us check whether a file is a pointer or not. However, in the future, we'll also want to use this function to find all of the .gitattributes files as well as pointers. Let's make the function take an arbitrary predicate so we can reuse it for both purposes.
At this point, we only read attributes from files on disk. However, in the future, we'll need to handle attribute data being read from the output of git cat-file --batch. In order to do so, we'll need to handle reading from an arbitrary Reader, so let's refactor out that code into a separate method and make the file reading code handle it.
In the future, we'll want to support detecting various problems with pointers. These fall into two types: pointers which are non-canonical and files which should be pointers but are not. Our existing scanning functions are not well suited to this, unfortunately, so we add some additional functions. We first scan all of the commits in the range we want and then, having found their object IDs, call git ls-tree to enumerate each item in its corresponding root tree. We accumulate the patterns in every found .gitattributes file, and we keep track of every other file we process, checking small files for being a pointer. Once we've processed the entire tree, we compute the set of patterns for the .gitattributes file and check each file against it. If the file is a pointer, we emit the pointer to our callback, and if it is not a pointer but matches the patterns, then we emit an error indicating that it should have been a pointer.
Now that we have support for checking invalid pointer files, let's add an additional check for that in git lfs fsck. Emit a message for each pointer file which is non-canonical and each file which should have been a pointer but was not. Add options to control which checks are run, and default to the old behavior for compatibility. While we're updating our code for multiple checks, let's also avoid relying on the fact that our check functions return nil when there are no problems and instead rely on them returning any empty slice, whether nil or non-nil. Document these new options in the manual page as well.
We currently exit successfully on a failed fsck, but we should instead exit 1 so that users can tell that there's been a problem. This also means that users can easily use git lfs fsck in CI to check for invalid pointers in their commits and have CI fail if there are. Add some tests for our new functionality as well. Reorder the existing code to reduce the need for an additional if statement.
Currently, git lfs fsck operates only on the HEAD and index. For objects, this is usually the right decision, since only objects for HEAD are checked out. However, for pointers, this may not be desired. A user may want to check pointers for a range of commits, such as during a CI job. To deal with these cases, let a user specify a revision or a simple range of revisions to operate on, and process those revisions. Note that we don't currently process the index with --pointers because this requires a completely different set of scanners which are not yet implemented. We can implement such a feature in a future revision if desired. In the tests, refactor out our setup code into a function for reuse in multiple tests.
ce1df0b to
e6c9d1d
Compare
We update the comment describing the scanRefsToChan() internal function to match the one for the newer scanRefsToTree() function, which was added in commit 608bc8d in PR git-lfs#4525. The older comment for the scanRefsToChan() function has been out of sync with the function's actual signature and usage since at least commit 3085dc3 in PR git-lfs#1743, when the function was changed to invoke a provided callback instead of return a channel of pointer object wrappers.
We update the comment describing the scanRefsToChan() internal function to match the one for the newer scanRefsToTree() function, which was added in commit 608bc8d in PR git-lfs#4525. The older comment for the scanRefsToChan() function has been out of sync with the function's actual signature and usage since at least commit 3085dc3 in PR git-lfs#1743, when the function was changed to invoke a provided callback instead of return a channel of pointer object wrappers.
Although we have a number of tests of the "git lfs fsck --pointers" option added in PR git-lfs#4525, we do not yet have any of the companion "git lfs fsck --objects" option. We therefore add a pair of tests, one of checking for expected output when Git LFS object files are corrupt or missing, and one checking for no output when no Git LFS objects exist in the repository. Note that at present we expect an exit code of 2, not 1, when the "git lfs fsck --objects" command is unable to move a completely missing Git LFS object file to the .git/lfs/bad directory. We expect to revise this behaviour and the relevant exit code checks in a subsequent commit.
Although we have a number of tests of the "git lfs fsck --pointers" option added in PR git-lfs#4525, we do not yet have any of the companion "git lfs fsck --objects" option. We therefore add a pair of tests, one of checking for expected output when Git LFS object files are corrupt or missing, and one checking for no output when no Git LFS objects exist in the repository. Note that at present we expect an exit code of 2, not 1, when the "git lfs fsck --objects" command is unable to move a completely missing Git LFS object file to the .git/lfs/bad directory. We expect to revise this behaviour and the relevant exit code checks in a subsequent commit.
Although we have a number of tests of the "git lfs fsck --pointers" option added in PR git-lfs#4525, we do not yet have any of the companion "git lfs fsck --objects" option. We therefore add a pair of tests, one of checking for expected output when Git LFS object files are corrupt or missing, and one checking for no output when no Git LFS objects exist in the repository. Note that at present we expect an exit code of 2, not 1, when the "git lfs fsck --objects" command is unable to move a completely missing Git LFS object file to the .git/lfs/bad directory. We expect to revise this behaviour and the relevant exit code checks in a subsequent commit.
In commit f4b8938 of PR git-lfs#3391 the t/t-attributes.sh test script was introduced with its initial "macros" test, which validates that the "git lfs track" command is able to parse macro attribute definitions in the top-level .gitattributes file and resolve references to those macros in the same file. It also confirms that the command does not accept macro definitions in .gitattributes files in subdirectories, as Git does not accept these either. However, Git does resolve macro attribute references from .gitattributes files in subdirectories, so long as they refer to macro attributes defined in the top-level .gitattributes (or one of the other files where definitions are accepted, such as the .git/info/attributes file). But the "git lfs track" command at present does not resolve such references consistently because it sorts the attributes files by path length and then processes them strictly in that order, from longest to shortest. Thus references to macro attributes defined in the top-level .gitattributes file from other attributes files never succeed because the top-level file is always parsed last (except for the global and system attributes files). We therefore add a note to this effect in the "macros" test to explain why we do not test valid macro attribute references in a .gitattributes file in a subdirectory. (There is also an inconsistency in how "git lfs track" handles references to macro attributes defined in the .git/info/attributes file, because if the references appear in .gitattributes files whose full file path in the repository is longer than ".git/info/attributes", then the references are not resolved as these files are parsed before the .git/info/attributes one, whereas references from other .gitattributes files are resolved.) Separately, in commit 608bc8d of PR git-lfs#4525 support for scanning the repository contents using the output of the "git ls-tree" command was added to help enable the "git lfs fsck" to search for invalid Git LFS pointer files. The GitScanner.ScanRefByTree() method invokes a chain of functions, of which catFileBatchTreeForPointers() reads Git blob metadata and examines each blob in turn to see if it is a Git LFS pointer or a .gitattributes file, and if it is the latter it reads and parses its contents, including macro attribute definitions if the file is the top-level .gitattributes file. We therefore add a "fsck detects invalid pointers with macro patterns" test to the t/t-fsck.sh test script which validates the ability of the "git lfs fsck" command to report as invalid pointers any files matching patterns with a "filter=lfs" attribute defined by reference to a macro attribute defined in the top-level .gitattributes file. To do this we refactor the setup_invalid_pointers() helper function so that we can reuse some of its code in a new, smaller function that just creates invalid pointers. However, we also add a note explaining that we can not yet test this behaviour with a .gitattributes file whose parent directory sorts before the top-level .gitattributes one in the output from "git ls-tree". Because that command outputs its results sorted by filepath, a file such as .dir/.gitattributes will be listed before the top-level .gitattributes file, and so any macro attribute references from the .dir/.gitattributes file to macro attributes defined in the top-level .gitattributes file will not be resolved in the way that Git resolves them. For now we defer resolution of this issue and the ones described regarding the "git lfs track" command to the future.
Support for Git macro attributes was added in a series of commits in PR git-lfs#3391, including commit 9d3e52d, where the Line structure of the "git/gitattr" package was updated to include a Macro element which would be set non-nil for macro definition lines in a .gitattributes file file, while the existing Pattern element would be set non-nil for all other lines. The "git lfs track" command, among others, was then adjusted to create a MacroProcessor structure (from the same "git/gitattr" package) and call its ProcessLines() method to resolve any macro references and thus convert the "raw" parsed Line structures into a set for which the Pattern element was always non-nil, and no Macro elements appeared. Later, the "git lfs fsck" command gained the ability to process macro definitions in .gitattributes files, in PR git-lfs#4525. However, the "git lfs migrate import" command was not adjusted, specifically in the implementation of its "--fixup" option, which initializes a Tree structure (also of the "git/gitattr" package) for the root tree of each commit in a repository's history using the package's New() function. This function traverses all the trees in the hierarchy and finds and parses all the .gitattributes files in them. Then, when the command visits each file within the commit's tree using the Rewrite() method of the Rewriter structure in the "git/githistory" package, it calls the (*Tree).Applied() method to match the file's path against any applicable Git attributes, to see if the file should be treated as a Git LFS object. This lack of support for macro attributes in the "git lfs migrate import --fixup" command was then propagated to the "git lfs migrate info --fixup" command in commit 4800c5e of PR git-lfs#4501, when the "git lfs migrate info" command was updated to respect the --fixup option. As a result, both of these commands (when used with the --fixup option) would panic if they encountered a .gitattributes file with any macro definition, as they would call the (*Tree).Applied() method and it would attempt to access the nil Pattern element of the lines with non-nil Macro elements. (Prior to the changes in commit c374d1f of PR git-lfs#5375 the "git lfs migrate import --fixup" command would then stall indefinitely, but it now also exits after the panic condition.) These problems were reported in issue git-lfs#5332. To resolve this problem and avoid similar ones in the future, we refactor the Line structure into a Line interface, which only provides a Attrs() method to retrieve a slice of Attr attributes, and no other methods. We then also define two additional interfaces, each of which embeds the Line interface, PatternLine and MacroLine, with corresponding getter methods for their respective elements. The ParseLine() function of the "git/gitattr" package now returns a slice of generic Line types, each of which is either a PatternLine or a MacroLine, but never both. Callers like the Applied() method of the Tree structure therefore need to perform type assertions or switches to determine which type of Line they are handling, which ensures they always access the line's data through safe methods. We then update the Go tests for the "git/gitattr" package as appropriate, and also add two tests each to the t/t-migrate-fixup.sh and t/t-migrate-import.sh test suites. All four of these new shell tests fail without the changes in this commit. In particular, several of these tests make sure to run the "git lfs migrate" commands outside of any shell pipeline so the test will fail if the command panics and produces no output, even if no output is the expected condition for a successful execution of the command.
Support for Git macro attributes was added in a series of commits in PR git-lfs#3391, including commit 9d3e52d, where the Line structure of the "git/gitattr" package was updated to include a Macro element which would be set non-nil for macro definition lines in a .gitattributes file file, while the existing Pattern element would be set non-nil for all other lines. The "git lfs track" command, among others, was then adjusted to create a MacroProcessor structure (from the same "git/gitattr" package) and call its ProcessLines() method to resolve any macro references and thus convert the "raw" parsed Line structures into a set for which the Pattern element was always non-nil, and no Macro elements appeared. Later, the "git lfs fsck" command gained the ability to process macro definitions in .gitattributes files, in PR git-lfs#4525. However, the "git lfs migrate import" command was not adjusted, specifically in the implementation of its "--fixup" option, which initializes a Tree structure (also of the "git/gitattr" package) for the root tree of each commit in a repository's history using the package's New() function. This function traverses all the trees in the hierarchy and finds and parses all the .gitattributes files in them. Then, when the command visits each file within the commit's tree using the Rewrite() method of the Rewriter structure in the "git/githistory" package, it calls the (*Tree).Applied() method to match the file's path against any applicable Git attributes, to see if the file should be treated as a Git LFS object. This lack of support for macro attributes in the "git lfs migrate import --fixup" command was then propagated to the "git lfs migrate info --fixup" command in commit 4800c5e of PR git-lfs#4501, when the "git lfs migrate info" command was updated to respect the --fixup option. As a result, both of these commands (when used with the --fixup option) panic if they encounter a .gitattributes file with any macro definitions, as they call the (*Tree).Applied() method and it attempts to access the nil Pattern element of the lines with non-nil Macro elements. (Prior to the changes in commit c374d1f of PR git-lfs#5375 the "git lfs migrate import --fixup" command would then stall indefinitely, but it now also exits after the panic condition.) These problems were reported in issue git-lfs#5332. To resolve this problem and avoid similar ones in the future, we refactor the Line structure into a Line interface, which only provides an Attrs() method to retrieve a slice of Attr attributes, and no other methods. We then also define two additional interfaces, each of which embeds the Line interface, PatternLine and MacroLine, with corresponding getter methods for their respective elements. The ParseLine() function of the "git/gitattr" package now returns a slice of generic Line types, each of which is either a PatternLine or a MacroLine, but never both. Callers like the Applied() method of the Tree structure therefore need to perform type assertions or switches to determine which type of Line they are handling, which ensures they always access the line's data through safe methods. We then update the Go tests for the "git/gitattr" package as appropriate, and also add two tests each to the t/t-migrate-fixup.sh and t/t-migrate-import.sh test suites. All four of these new shell tests fail without the changes in this commit. In particular, several of these tests make sure to run the "git lfs migrate" commands outside of any shell pipeline so the test will fail if the command panics and produces no output, even if no output is the expected condition for a successful execution of the command.
Currently,
git lfs fsckchecks for the objects in the repository to have correct hashes and removes any that don't. However, there are also some other problems that can occur in repositories, namely that pointer files can be non-canonical, or files which should be pointers are not. These problems are not yet caught, but this PR introduces code to check for these two cases.In addition, we now exit 1 if there's a failed fsck, in addition to exiting 2 on error and 0 on success.
Fixes #4091
/cc #3865