Skip to content

Make fsck able to check for invalid pointers#4525

Merged
bk2204 merged 16 commits intogit-lfs:mainfrom
bk2204:more-powerful-fsck
Jul 19, 2021
Merged

Make fsck able to check for invalid pointers#4525
bk2204 merged 16 commits intogit-lfs:mainfrom
bk2204:more-powerful-fsck

Conversation

@bk2204
Copy link
Member

@bk2204 bk2204 commented Jun 11, 2021

Currently, git lfs fsck checks for the objects in the repository to have correct hashes and removes any that don't. However, there are also some other problems that can occur in repositories, namely that pointer files can be non-canonical, or files which should be pointers are not. These problems are not yet caught, but this PR introduces code to check for these two cases.

In addition, we now exit 1 if there's a failed fsck, in addition to exiting 2 on error and 0 on success.

Fixes #4091
/cc #3865

@bk2204 bk2204 marked this pull request as ready for review June 14, 2021 13:41
@bk2204 bk2204 requested a review from a team June 14, 2021 13:41
Copy link
Member

@chrisd8088 chrisd8088 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again for another great addition to the feature set for v3.0! I only had minor questions, mostly just about comments and other fiddly whatnots; nothing very substantive. This new functionality is awesome.

@bk2204 bk2204 force-pushed the more-powerful-fsck branch from 488d66c to 8dbfa72 Compare July 13, 2021 17:52
@bk2204
Copy link
Member Author

bk2204 commented Jul 13, 2021

As with the other PR, I've added some squash and fixup commits on the end and will squash them down once you're happy with the result.

Copy link
Member

@chrisd8088 chrisd8088 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really great, thank you!! All of my remaining comments are in the category of things I'm not certain are necessarily things which need to be changed (plus a few which are just problems with legacy comments or tests that I noticed in passing).

I did wonder in particular about two possible renamings, of the new git/scanner_test.go to git/ls_tree_scanner_test.go, and of runScanTreeForInvalidPointers() and catFileBatchTreeForInvalidPointers() to drop the Invalid adjective, if that's appropriate.

Again, thanks so much for this, and my apologies for the slow and disjointed reviews.

@bk2204
Copy link
Member Author

bk2204 commented Jul 14, 2021

Again, thanks so much for this, and my apologies for the slow and disjointed reviews.

There's definitely nothing to apologize for. I appreciate your thorough and detailed reviews and the care and attention you've paid to my recent series, despite the fact that there's a lot of code to look through. Our codebase and future maintainers (or future we) will be better off for it.

bk2204 added 16 commits July 14, 2021 18:12
We want to add support for checking pointer files with git lfs fsck.  In
order to do so, let's add a flag to the pointer file that indicates
whether it's canonical: that is, whether serializing the file in the
expected way produces exactly the same data as the input.
Right now, the empty file is its own pointer, but our pointer parser
doesn't consider this to be a pointer.  We'll fix that in a future
change, but for now, that when we parse these files, we don't fail if
the object representing the empty SHA-256 value is missing on disk,
since we don't need it to be present.
Right now, our code to parse pointers doesn't handle empty files, which
are their own pointer.  However, in the future, it will learn to do so.
When we do this, we don't want to upload or download the object with
this OID, so let's special-case this code path to avoid attempting to
transfer these objects over the network.
The empty string is its own pointer, so let's parse it as one.  Instead
of returning an error when the input string is empty, we emit a pointer
file with the correct OID and size.  The code already handles emitting
such a pointer correctly.
Currently, Git LFS tries very hard to ignore Git files that are not
pointers.  However, we'd like to add a mode to git lfs fsck where people
can scan for invalid pointers so they can check for problems in CI.
Let's add an error type that we can use to report these errors and
extract the relevant data from them to provide to the user.
Using a simple exclamation point in front of a command inverts the sense
of the command, but it also disables the error checking of "set -e".  As
a result, let's use "&& exit 1 || true" since that will cause error
checking to be enabled.
Currently we process pointers in a rather lax way and accept a variety
of non-canonical pointers, such as those with CRLF line endings or a
size value of 0 (which should be an empty file instead).  However, we
have a flag to indicate whether the pointer is canonical when parsing
it, so let's add a --strict flag so that we can allow people to reject
non-canonical pointers.

Note that while strictness is not the default, we add a --no-strict flag
as well so we can allow people to use it now and change the default at a
later time (probably 4.0).
In the future, we'll add additional types of fsck in addition to
checking objects.  Let's move the existing code into a function so we
can call several different functions to check other types of things
about our repository.
Currently, we only output messages related to checking the objects
present on the local system.  However, in the future, we'll add several
other different checks to git lfs fsck.  In order to make that work,
let's switch to an output format that can be more easily parsed by
commands.
We're going to need to scan trees with ls-tree in the git package in the
future, and we can't call into the lfs package because of import loops,
so let's move the scanner to the git package.

While we're at it, let's make two important changes.  First, let's
remove the blob size check, since we're going to want this functionality
in order to read all blobs, not just small ones.  As part of that, move
that check into the place where we use the output of the scanner so we
don't lose this check.

The other check is to change the name Sha1 to Oid, since we now support
SHA-256 repos as well as SHA-1 repos.

Move the tests and some of the helper functions to the new package as
well.
Right now, this function lets us check whether a file is a pointer or
not.  However, in the future, we'll also want to use this function to
find all of the .gitattributes files as well as pointers.  Let's make
the function take an arbitrary predicate so we can reuse it for both
purposes.
At this point, we only read attributes from files on disk.  However, in
the future, we'll need to handle attribute data being read from the
output of git cat-file --batch.

In order to do so, we'll need to handle reading from an arbitrary
Reader, so let's refactor out that code into a separate method and make
the file reading code handle it.
In the future, we'll want to support detecting various problems with
pointers.  These fall into two types: pointers which are non-canonical
and files which should be pointers but are not.

Our existing scanning functions are not well suited to this,
unfortunately, so we add some additional functions.  We first scan all
of the commits in the range we want and then, having found their object
IDs, call git ls-tree to enumerate each item in its corresponding root
tree.  We accumulate the patterns in every found .gitattributes file,
and we keep track of every other file we process, checking small files
for being a pointer.

Once we've processed the entire tree, we compute the set of patterns for
the .gitattributes file and check each file against it.  If the file is
a pointer, we emit the pointer to our callback, and if it is not a
pointer but matches the patterns, then we emit an error indicating that
it should have been a pointer.
Now that we have support for checking invalid pointer files, let's add
an additional check for that in git lfs fsck.  Emit a message for each
pointer file which is non-canonical and each file which should have been
a pointer but was not.

Add options to control which checks are run, and default to the old
behavior for compatibility.  While we're updating our code for multiple
checks, let's also avoid relying on the fact that our check functions
return nil when there are no problems and instead rely on them returning
any empty slice, whether nil or non-nil.

Document these new options in the manual page as well.
We currently exit successfully on a failed fsck, but we should instead
exit 1 so that users can tell that there's been a problem.  This also
means that users can easily use git lfs fsck in CI to check for invalid
pointers in their commits and have CI fail if there are.

Add some tests for our new functionality as well.

Reorder the existing code to reduce the need for an additional if
statement.
Currently, git lfs fsck operates only on the HEAD and index.  For
objects, this is usually the right decision, since only objects for HEAD
are checked out.

However, for pointers, this may not be desired.  A user may want to
check pointers for a range of commits, such as during a CI job.  To deal
with these cases, let a user specify a revision or a simple range of
revisions to operate on, and process those revisions.

Note that we don't currently process the index with --pointers because
this requires a completely different set of scanners which are not yet
implemented.  We can implement such a feature in a future revision if
desired.

In the tests, refactor out our setup code into a function for reuse in
multiple tests.
@bk2204 bk2204 force-pushed the more-powerful-fsck branch from ce1df0b to e6c9d1d Compare July 14, 2021 18:24
@bk2204 bk2204 merged commit 05f7e0c into git-lfs:main Jul 19, 2021
@bk2204 bk2204 deleted the more-powerful-fsck branch July 19, 2021 16:16
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request Jul 21, 2021
We update the comment describing the scanRefsToChan() internal
function to match the one for the newer scanRefsToTree() function,
which was added in commit 608bc8d
in PR git-lfs#4525.  The older comment for the scanRefsToChan() function
has been out of sync with the function's actual signature and
usage since at least commit 3085dc3
in PR git-lfs#1743, when the function was changed to invoke a provided
callback instead of return a channel of pointer object wrappers.
pcal43 pushed a commit to pcal43/git-lfs-hack that referenced this pull request Jul 22, 2021
We update the comment describing the scanRefsToChan() internal
function to match the one for the newer scanRefsToTree() function,
which was added in commit 608bc8d
in PR git-lfs#4525.  The older comment for the scanRefsToChan() function
has been out of sync with the function's actual signature and
usage since at least commit 3085dc3
in PR git-lfs#1743, when the function was changed to invoke a provided
callback instead of return a channel of pointer object wrappers.
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request Feb 6, 2022
Although we have a number of tests of the "git lfs fsck --pointers"
option added in PR git-lfs#4525, we do not yet have any of the companion
"git lfs fsck --objects" option.  We therefore add a pair of tests,
one of checking for expected output when Git LFS object files are
corrupt or missing, and one checking for no output when no Git LFS
objects exist in the repository.

Note that at present we expect an exit code of 2, not 1, when the
"git lfs fsck --objects" command is unable to move a completely
missing Git LFS object file to the .git/lfs/bad directory.  We
expect to revise this behaviour and the relevant exit code checks
in a subsequent commit.
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request Feb 7, 2022
Although we have a number of tests of the "git lfs fsck --pointers"
option added in PR git-lfs#4525, we do not yet have any of the companion
"git lfs fsck --objects" option.  We therefore add a pair of tests,
one of checking for expected output when Git LFS object files are
corrupt or missing, and one checking for no output when no Git LFS
objects exist in the repository.

Note that at present we expect an exit code of 2, not 1, when the
"git lfs fsck --objects" command is unable to move a completely
missing Git LFS object file to the .git/lfs/bad directory.  We
expect to revise this behaviour and the relevant exit code checks
in a subsequent commit.
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request Feb 7, 2022
Although we have a number of tests of the "git lfs fsck --pointers"
option added in PR git-lfs#4525, we do not yet have any of the companion
"git lfs fsck --objects" option.  We therefore add a pair of tests,
one of checking for expected output when Git LFS object files are
corrupt or missing, and one checking for no output when no Git LFS
objects exist in the repository.

Note that at present we expect an exit code of 2, not 1, when the
"git lfs fsck --objects" command is unable to move a completely
missing Git LFS object file to the .git/lfs/bad directory.  We
expect to revise this behaviour and the relevant exit code checks
in a subsequent commit.
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request Nov 8, 2022
In commit f4b8938 of PR git-lfs#3391 the
t/t-attributes.sh test script was introduced with its initial "macros"
test, which validates that the "git lfs track" command is able to parse
macro attribute definitions in the top-level .gitattributes file and
resolve references to those macros in the same file.  It also confirms
that the command does not accept macro definitions in .gitattributes
files in subdirectories, as Git does not accept these either.

However, Git does resolve macro attribute references from .gitattributes
files in subdirectories, so long as they refer to macro attributes
defined in the top-level .gitattributes (or one of the other files where
definitions are accepted, such as the .git/info/attributes file).  But
the "git lfs track" command at present does not resolve such references
consistently because it sorts the attributes files by path length and
then processes them strictly in that order, from longest to shortest.
Thus references to macro attributes defined in the top-level .gitattributes
file from other attributes files never succeed because the top-level file
is always parsed last (except for the global and system attributes files).

We therefore add a note to this effect in the "macros" test to explain
why we do not test valid macro attribute references in a .gitattributes
file in a subdirectory.

(There is also an inconsistency in how "git lfs track" handles references
to macro attributes defined in the .git/info/attributes file, because if
the references appear in .gitattributes files whose full file path in the
repository is longer than ".git/info/attributes", then the references are
not resolved as these files are parsed before the .git/info/attributes one,
whereas references from other .gitattributes files are resolved.)

Separately, in commit 608bc8d of PR git-lfs#4525
support for scanning the repository contents using the output of the
"git ls-tree" command was added to help enable the "git lfs fsck" to
search for invalid Git LFS pointer files.  The GitScanner.ScanRefByTree()
method invokes a chain of functions, of which catFileBatchTreeForPointers()
reads Git blob metadata and examines each blob in turn to see if it is
a Git LFS pointer or a .gitattributes file, and if it is the latter it
reads and parses its contents, including macro attribute definitions if
the file is the top-level .gitattributes file.

We therefore add a "fsck detects invalid pointers with macro patterns"
test to the t/t-fsck.sh test script which validates the ability of the
"git lfs fsck" command to report as invalid pointers any files matching
patterns with a "filter=lfs" attribute defined by reference to a macro
attribute defined in the top-level .gitattributes file.

To do this we refactor the setup_invalid_pointers() helper function so
that we can reuse some of its code in a new, smaller function that just
creates invalid pointers.

However, we also add a note explaining that we can not yet test this
behaviour with a .gitattributes file whose parent directory sorts
before the top-level .gitattributes one in the output from "git ls-tree".
Because that command outputs its results sorted by filepath, a file such
as .dir/.gitattributes will be listed before the top-level .gitattributes
file, and so any macro attribute references from the .dir/.gitattributes
file to macro attributes defined in the top-level .gitattributes file
will not be resolved in the way that Git resolves them.

For now we defer resolution of this issue and the ones described regarding
the "git lfs track" command to the future.
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request Jun 1, 2023
Support for Git macro attributes was added in a series of commits
in PR git-lfs#3391, including commit 9d3e52d,
where the Line structure of the "git/gitattr" package was updated
to include a Macro element which would be set non-nil for macro
definition lines in a .gitattributes file file, while the existing
Pattern element would be set non-nil for all other lines.

The "git lfs track" command, among others, was then adjusted to
create a MacroProcessor structure (from the same "git/gitattr" package)
and call its ProcessLines() method to resolve any macro references
and thus convert the "raw" parsed Line structures into a set for
which the Pattern element was always non-nil, and no Macro elements
appeared.

Later, the "git lfs fsck" command gained the ability to process
macro definitions in .gitattributes files, in PR git-lfs#4525.

However, the "git lfs migrate import" command was not adjusted,
specifically in the implementation of its "--fixup" option, which
initializes a Tree structure (also of the "git/gitattr" package) for
the root tree of each commit in a repository's history using the
package's New() function.  This function traverses all the trees
in the hierarchy and finds and parses all the .gitattributes files
in them.  Then, when the command visits each file within the
commit's tree using the Rewrite() method of the Rewriter structure
in the "git/githistory" package, it calls the (*Tree).Applied()
method to match the file's path against any applicable Git attributes,
to see if the file should be treated as a Git LFS object.

This lack of support for macro attributes in the "git lfs migrate
import --fixup" command was then propagated to the "git lfs migrate
info --fixup" command in commit 4800c5e
of PR git-lfs#4501, when the "git lfs migrate info" command was updated to
respect the --fixup option.

As a result, both of these commands (when used with the --fixup option)
would panic if they encountered a .gitattributes file with any macro
definition, as they would call the (*Tree).Applied() method and it
would attempt to access the nil Pattern element of the lines with
non-nil Macro elements.  (Prior to the changes in commit
c374d1f of PR git-lfs#5375 the "git lfs migrate
import --fixup" command would then stall indefinitely, but it now
also exits after the panic condition.)  These problems were reported
in issue git-lfs#5332.

To resolve this problem and avoid similar ones in the future, we
refactor the Line structure into a Line interface, which only provides
a Attrs() method to retrieve a slice of Attr attributes, and no other
methods.  We then also define two additional interfaces, each of which
embeds the Line interface, PatternLine and MacroLine, with corresponding
getter methods for their respective elements.

The ParseLine() function of the "git/gitattr" package now returns a
slice of generic Line types, each of which is either a PatternLine
or a MacroLine, but never both.  Callers like the Applied() method of
the Tree structure therefore need to perform type assertions or
switches to determine which type of Line they are handling, which
ensures they always access the line's data through safe methods.

We then update the Go tests for the "git/gitattr" package as
appropriate, and also add two tests each to the t/t-migrate-fixup.sh
and t/t-migrate-import.sh test suites.  All four of these new
shell tests fail without the changes in this commit.  In particular,
several of these tests make sure to run the "git lfs migrate" commands
outside of any shell pipeline so the test will fail if the command
panics and produces no output, even if no output is the expected
condition for a successful execution of the command.
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request Jun 1, 2023
Support for Git macro attributes was added in a series of commits
in PR git-lfs#3391, including commit 9d3e52d,
where the Line structure of the "git/gitattr" package was updated
to include a Macro element which would be set non-nil for macro
definition lines in a .gitattributes file file, while the existing
Pattern element would be set non-nil for all other lines.

The "git lfs track" command, among others, was then adjusted to
create a MacroProcessor structure (from the same "git/gitattr" package)
and call its ProcessLines() method to resolve any macro references
and thus convert the "raw" parsed Line structures into a set for
which the Pattern element was always non-nil, and no Macro elements
appeared.

Later, the "git lfs fsck" command gained the ability to process
macro definitions in .gitattributes files, in PR git-lfs#4525.

However, the "git lfs migrate import" command was not adjusted,
specifically in the implementation of its "--fixup" option, which
initializes a Tree structure (also of the "git/gitattr" package) for
the root tree of each commit in a repository's history using the
package's New() function.  This function traverses all the trees
in the hierarchy and finds and parses all the .gitattributes files
in them.  Then, when the command visits each file within the
commit's tree using the Rewrite() method of the Rewriter structure
in the "git/githistory" package, it calls the (*Tree).Applied()
method to match the file's path against any applicable Git attributes,
to see if the file should be treated as a Git LFS object.

This lack of support for macro attributes in the "git lfs migrate
import --fixup" command was then propagated to the "git lfs migrate
info --fixup" command in commit 4800c5e
of PR git-lfs#4501, when the "git lfs migrate info" command was updated to
respect the --fixup option.

As a result, both of these commands (when used with the --fixup option)
panic if they encounter a .gitattributes file with any macro
definitions, as they call the (*Tree).Applied() method and it
attempts to access the nil Pattern element of the lines with
non-nil Macro elements.  (Prior to the changes in commit
c374d1f of PR git-lfs#5375 the "git lfs migrate
import --fixup" command would then stall indefinitely, but it now
also exits after the panic condition.)  These problems were reported
in issue git-lfs#5332.

To resolve this problem and avoid similar ones in the future, we
refactor the Line structure into a Line interface, which only provides
an Attrs() method to retrieve a slice of Attr attributes, and no other
methods.  We then also define two additional interfaces, each of which
embeds the Line interface, PatternLine and MacroLine, with corresponding
getter methods for their respective elements.

The ParseLine() function of the "git/gitattr" package now returns a
slice of generic Line types, each of which is either a PatternLine
or a MacroLine, but never both.  Callers like the Applied() method of
the Tree structure therefore need to perform type assertions or
switches to determine which type of Line they are handling, which
ensures they always access the line's data through safe methods.

We then update the Go tests for the "git/gitattr" package as
appropriate, and also add two tests each to the t/t-migrate-fixup.sh
and t/t-migrate-import.sh test suites.  All four of these new
shell tests fail without the changes in this commit.  In particular,
several of these tests make sure to run the "git lfs migrate" commands
outside of any shell pipeline so the test will fail if the command
panics and produces no output, even if no output is the expected
condition for a successful execution of the command.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add runnable check for invalid files (blob instead of pointers)

2 participants