Prune feature (delete old / unreferenced local objects)#742
Prune feature (delete old / unreferenced local objects)#742technoweenie merged 60 commits intogit-lfs:masterfrom
Conversation
Tidier and frequently used.
This is to mirror what a manual full delete & fetch would do, should not keep previous versions (especially as this means overlapping the date so often you keep at least 1 previous version)
|
Oh, the intention is to add a |
There was a problem hiding this comment.
It feels weird having both --verify-remote and --no-verify remote. I assume it's part of the prune spec? Maybe we could have setting one to true set the other to false so that we don't always have to check two things?
There was a problem hiding this comment.
This is the standard git pattern for options that can be enabled on the command line, or overridden to be false again e.g. when they're set to true in gitconfig. See --ff vs --no-ff, --squash vs --no-squash and many others. I prefer to feel like git whenever possible.
Update nsis to repsect silent mode git-lfs#789
|
I merged master and had a few conflicts. Though based on the diffs, they were just cases where this PR and master introduced functions in the same place. I think the majority of it came from #801. You can see my changes: https://github.com/github/git-lfs/compare/sinbad-prune FWIW I'm not super thrilled about |
|
Looks good, I've merged those changes into this PR. |
Prune feature (delete old / unreferenced local objects)
In PR git-lfs#742 the "git lfs prune" command was introduced along with accompanying tests, one of which is the "prune keep unpushed" test that checks whether Git LFS objects referenced by not-yet- pushed commits are always retained by "git lfs prune". In commit 978899e of that PR the initial version of the "prune" tests included some fixture data with commits that were referenced via a tag only, not via a branch ref. However, the test never confirmed that the Git LFS objects in these commits were retained by the "prune" command. This initial test was then refactored into several tests in commits 03b85e0 and 58dfa23, including the "prune keep unpushed" one, and in the process two lines were left which referred to the fixture data ("oid_keepunpushedtagged1" and "oid_keepunpushedtagged2") but this data was otherwise dropped from the test, and no tag check was implemented. We therefore re-introduce some fixture data for this test which simulates a tag that points to a commit from a deleted branch, so that the tag is the only reference to this commit and its ancestors. We then ensure that "git lfs prune" retains all the Git LFS objects from these commits, even if they are not recent, when they have not been pushed. Once they are pushed, we then confirm that only the object in the tagged commit is retained (because it is a referenced by a recent ref).
In PR git-lfs#742 the "git lfs prune" command was introduced along with accompanying tests, one of which is the "prune keep unpushed" test that checks whether Git LFS objects referenced by not-yet- pushed commits are always retained by "git lfs prune". In commit 978899e of that PR the initial version of the "prune" tests included some fixture data with commits that were referenced via a tag only, not via a branch ref. However, the test never confirmed that the Git LFS objects in these commits were retained by the "prune" command. This initial test was then refactored into several tests in commits 03b85e0 and 58dfa23, including the "prune keep unpushed" one, and in the process two lines were left which referred to the fixture data ("oid_keepunpushedtagged1" and "oid_keepunpushedtagged2") but this data was otherwise dropped from the test, and no tag check was implemented. We therefore re-introduce some fixture data for this test which simulates a tag that points to a commit from a deleted branch, so that the tag is the only reference to this commit and its ancestors. We then ensure that "git lfs prune" retains all the Git LFS objects from these commits, even if they are not recent, when they have not been pushed. Once they are pushed, we then confirm that only the object in the tagged commit is retained (because it is a referenced by a recent ref).
In PR git-lfs#742 the "git lfs prune" command was introduced along with accompanying tests, one of which is the "prune keep unpushed" test that checks whether Git LFS objects referenced by not-yet- pushed commits are always retained by "git lfs prune". In commit 978899e of that PR the initial version of the "prune" tests included some fixture data with commits that were referenced via a tag only, not via a branch ref. However, the test never confirmed that the Git LFS objects in these commits were retained by the "prune" command. This initial test was then refactored into several tests in commits 03b85e0 and 58dfa23, including the "prune keep unpushed" one, and in the process two lines were left which referred to the fixture data ("oid_keepunpushedtagged1" and "oid_keepunpushedtagged2") but this data was otherwise dropped from the test, and no tag check was implemented. We therefore re-introduce some fixture data for this test which simulates a tag that points to a commit from a deleted branch, so that the tag is the only reference to this commit and its ancestors. We then ensure that "git lfs prune" retains all the Git LFS objects from these commits, even if they are not recent, when they have not been pushed. Once they are pushed, we then confirm that only the object in the tagged commit is retained (because it is a referenced by a recent ref).
In PR git-lfs#742 the "git lfs prune" command was introduced along with accompanying tests, one of which is the "prune keep unpushed" test that checks whether Git LFS objects referenced by not-yet- pushed commits are always retained by "git lfs prune". In commit 978899e of that PR the initial version of the "prune" tests included some fixture data with commits that were referenced via a tag only, not via a branch ref. However, the test never confirmed that the Git LFS objects in these commits were retained by the "prune" command. This initial test was then refactored into several tests in commits 03b85e0 and 58dfa23, including the "prune keep unpushed" one, and in the process two lines were left which referred to the fixture data ("oid_keepunpushedtagged1" and "oid_keepunpushedtagged2") but this data was otherwise dropped from the test, and no tag check was implemented. We therefore re-introduce some fixture data for this test which simulates a tag that points to a commit from a deleted branch, so that the tag is the only reference to this commit and its ancestors. We then ensure that "git lfs prune" retains all the Git LFS objects from these commits, even if they are not recent, when they have not been pushed. Once they are pushed, we then confirm that only the object in the tagged commit is retained (because it is a referenced by a recent ref). We also update a comment in the "prune unreferenced and old" test which refers to a filename in its fixture data that corresponds to how it was originally defined in commit 978899e, but which was subsequently changed (at the same time the comment was added) in commit 03b85e0.
The GetAllWorkTrees() function in our "git" package was introduced as part of the implementation of the "git lfs prune" command in PR git-lfs#742. This function examines the contents of a repository's ".git/worktrees" folder and returns an array of Worktree structures for any linked working trees it finds, plus one Worktree structure for the main working tree, if one exists. The "git lfs prune" command then checks these working trees' current Git references and indexes for any Git LFS objects it should retain in the Git LFS object cache. In subsequent commits in this PR we expect to add an alternative implementation of the GetAllWorkTrees() function which uses the "git worktree list" command, as this will ensure we remain compatible with some forthcoming changes in Git that will alter the contents of the "gitdir" files in ".git/worktrees" hierarchy. We will need to retain the legacy implementation of the GetAllWorkTrees() function as well, as prior to Git version 2.36.0 the "git worktree list" command is either not available or does not support the full set of options we will require. Before introducing our new version of the GetAllWorkTrees() function, we first expand the set of tests and checks we perform against it and the "git lfs prune" command. This will allow us to verify that both implementations return identical data under the same conditions. In particular, in our existing TestWorkTrees() Go test function we add a linked working tree which references a detached HEAD in the form of local tag instead of a branch, and confirm that its SHA is also used as its name, and that its type is returned as RefTypeOther. We also make use of the NoError() assertion to check the returned error value from the GetAllWorkTrees() function. As well, we add a new TestWorktreesBareRepo() Go test function which checks that when GetAllWorkTrees() is run in a bare repository it does not return a Worktree structure at all, as there is no main working tree. To create the bare repository we add a NewBareRepo() wrapper function to our "util" test utility package. In our t/t-prune-worktree.sh shell test script, we slightly expand the set of checks performed by our "prune worktree" test, just to clarify which objects are pruned by our "git lfs prune" command when a linked working tree has been removed but the "git worktree prune" command has not yet been run. Finally, we add a "prune worktree (bare main)" test to our test script which confirms that when the "git lfs prune" command is run in a working tree linked to a bare repository, no Git LFS objects are pruned, even if they match no retention configuration settings. This is because the bare main repository has no Git remotes, so all Git LFS objects are treated as never having been pushed to a remote and are therefore all ineligible for pruning.
The GetAllWorkTrees() function in our "git" package was introduced as part of the implementation of the "git lfs prune" command in PR git-lfs#742, along with the Worktree structure and various related functions, such as the pruneTaskGetRetainedWorktree() function run by the "git lfs prune" command. In all instances, we use the terms "Worktree" and "worktree" rather than the camel-case "WorkTree" or "workTree", except in the name of the GetAllWorkTrees() function and its associated TestWorkTrees() test function. Before we add another implementation of the GetAllWorkTrees() function in a subsequent commit in this PR, we first revise the function's name to match that of the structures it returns, which slightly improves the consistency of our naming scheme.
The GetAllWorktrees() function in our "git" package was introduced as part of the implementation of the "git lfs prune" command in PR git-lfs#742. This function examines the contents of a repository's ".git/worktrees" folder and returns an array of Worktree structures for any linked working trees it finds, plus one Worktree structure for the main working tree, if one exists. The "git lfs prune" command then checks these working trees' current Git references and indexes for any Git LFS objects it should retain in the Git LFS object cache. In subsequent commits in this PR we expect to add an alternative implementation of the GetAllWorktrees() function which uses the "git worktree list" command, as this will ensure we remain compatible with some forthcoming changes in Git that will alter the contents of the "gitdir" files in ".git/worktrees" hierarchy. We will need to retain the legacy implementation of the GetAllWorktrees() function as well, as prior to Git version 2.36.0 the "git worktree list" command is either not available or does not support the full set of options we will require. The "git worktree list" command reports when a linked working tree no longer exists and so its data in the ".git/worktrees" hierarchy could be pruned with the "git worktree prune" command. Our new implementation of the GetAllWorktrees() function will therefore be able to return this prunable state as a flag in the Worktree structure. The pruneTaskGetRetainedIndex() function run by our "git lfs prune" command calls the ScanIndex() method of the GitScanner structure in our "lfs" package, whose internal functions then create two DiffIndexScanner structures and invoke their Scan() methods repeatedly. That structure's initialization function, NewDiffIndexScanner(), uses the DiffIndex() function in our "git" package to start a "git diff-index" command and pipe its output into a buffer, which is then consumed by the DiffIndexScanner's Scan() method. As we run these "git diff-index" commands in each linked working tree, if it no longer actually exists, the commands simply fail. We silently ignore these failures because the DiffIndex() function uses the gitBufferedStdout() function to start the command, and that discards the stderr stream, we never call the Wait() method of the underlying Cmd structure from the "os/exec" package, so we also discard the exit code from the "git diff-index" command. However, we can avoid running unnecessary "git diff-index" commands entirely if we detect that a linked working tree no longer exists. As this will be straightforward in our new implementation of the GetAllWorktrees() function, we first update our existing version of the function to perform a check similar to one made by Git, and return a true flag value in our Worktree structure if the working tree is missing. We add a Prunable flag to the Worktree structure, and check it in the pruneTaskGetRetainedWorktree() function so we only run the pruneTaskGetRetainedIndex() function if a linked working tree's Prunable flag is false. We then expand our existing TestWorktrees() test function to validate that when a linked working tree is removed, the Prunable flag is set as we expect. Note that our "prune worktree" test in the t/t-prune-worktree.sh test script already performs checks of our "git lfs prune" command after a linked working tree is removed. These checks pass without the changes in this commit because the "git diff-index" commands we try to run in the working tree simply fail, and we then ignore those failures. With the changes in this commit, the behaviour of the "git lfs prune" command remains the same, but it no longer tries to run "git diff-index" in non-extant working trees.
The GetAllWorktrees() function in our "git" package was introduced as part of the implementation of the "git lfs prune" command in PR git-lfs#742. This function examines the contents of a repository's ".git/worktrees" folder and returns an array of Worktree structures for any linked working trees it finds, plus one Worktree structure for the main working tree, if one exists. The "git lfs prune" command then checks these working trees' current Git references and indexes for any Git LFS objects it should retain in the Git LFS object cache. As noted in previous commits in this PR, to check for Git LFS objects which should be retained, we run the "git diff-index" command in each extant linked working tree. Specifically, the commands are started by the DiffIndex() function in our "git" package, which is run by the NewDiffIndexScanner() function called by the internal functions invoked by the ScanIndex() method of the GitScanner structure in our "lfs" package. That method is run by the pruneTaskGetRetainedIndex() function of our "git lfs prune" command. In order to execute the "git diff-index" command in each linked working tree, we require the full, absolute paths of the working trees. At the moment, our GetAllWorktrees() function expects that it can retrieve these full paths from the "gitdir" files in each working tree's subdirectory under the ".git/worktrees" directory. However, forthcoming changes in Git, likely to be released in Git version 2.48.0, will alter the contents of these "gitdir" files to contain relative paths. As a result, we will no longer be able to rely on them to determine a full path to each linked working tree. In particular, commit git/git@717af91 causes Git to now write relative paths into the "gitdir" files, while still supporting any legacy absolute paths found in existing "gitdir" files. We can avoid the need to try to parse these relative paths and determine the correct full path by using the "git worktree list" command to retrieve the list of linked working trees, plus the main working tree, if any. This command returns full paths along with the current Git references associated with each working tree, plus several additional attributes. Prior to Git version 2.36.0, though, the "git worktree list" command either did not exist, lacked support for the "--porcelain" option, or lacked support for the "-z" option, so we continue to use our existing GetAllWorktrees() function's implementation when the available Git version is lower than 2.36.0. We therefore rename the existing version of our GetAllWorktrees() function to getAllWorktreesFromGitDir(), and invoke it at the top of our new version of the GetAllWorktrees() function if the Git version is not 2.36.0 or above. Our new implementation of the GetAllWorktrees() function simply calls the "git worktree list --porcelain -z" command and parses the output, creating the same array of Worktree structures as if we had parsed the contents of the ".git/worktrees" hierarchy directly, except that we can depend on the Git command to resolve any relative paths in "gitdir" files into absolute ones for us. Git also checks if these paths actually exist, and if they do not, returns a "prunable" attribute in its output, so we can appropriately set the Prunable flag that we introduced into our Worktree structure in a prior commit in this PR. Our parsing logic requires that we find at least an initial "worktree" line and a subsequent "HEAD" line in the output from the "git worktree list" command. Assuming those are found and have non-empty values, we can return a Worktree structure containing them. However, if we also find a "bare" attribute in a subsequent line, we do not return a Worktree structure for this entry, as we want to ignore bare main repositories in the same manner as our existing GetAllWorktrees() function does.
The GetAllWorktrees() function in our "git" package was introduced as part of the implementation of the "git lfs prune" command in PR git-lfs#742. This function examines the contents of a repository's ".git/worktrees" folder and returns an array of Worktree structures for any linked working trees it finds, plus one Worktree structure for the main working tree, if one exists. The "git lfs prune" command then checks these working trees' current Git references and indexes for any Git LFS objects it should retain in the Git LFS object cache. As noted in previous commits in this PR, to check for Git LFS objects which should be retained, we run the "git diff-index" command in each extant linked working tree. Specifically, the commands are started by the DiffIndex() function in our "git" package, which is run by the NewDiffIndexScanner() function called by the internal functions invoked by the ScanIndex() method of the GitScanner structure in our "lfs" package. That method is run by the pruneTaskGetRetainedIndex() function of our "git lfs prune" command. In order to execute the "git diff-index" command in each linked working tree, we require the full, absolute paths of the working trees. At the moment, our GetAllWorktrees() function expects that it can retrieve these full paths from the "gitdir" files in each working tree's subdirectory under the ".git/worktrees" directory. However, forthcoming changes in Git, likely to be released in Git version 2.48.0, will alter the contents of these "gitdir" files to contain relative paths. As a result, we will no longer be able to rely on them to determine a full path to each linked working tree. In particular, commit git/git@717af91 causes Git to now write relative paths into the "gitdir" files, while still supporting any legacy absolute paths found in existing "gitdir" files. We can avoid the need to try to parse these relative paths and determine the correct full path by using the "git worktree list" command to retrieve the list of linked working trees, plus the main working tree, if any. This command returns full paths along with the current Git references associated with each working tree, plus several additional attributes. Prior to Git version 2.36.0, though, the "git worktree list" command either did not exist, lacked support for the "--porcelain" option, or lacked support for the "-z" option, so we continue to use our existing GetAllWorktrees() function's implementation when the available Git version is lower than 2.36.0. We therefore rename the existing version of our GetAllWorktrees() function to getAllWorktreesFromGitDir(), and invoke it at the top of our new version of the GetAllWorktrees() function if the Git version is not 2.36.0 or above. Our new implementation of the GetAllWorktrees() function simply calls the "git worktree list --porcelain -z" command and parses the output, creating the same array of Worktree structures as if we had parsed the contents of the ".git/worktrees" hierarchy directly, except that we can depend on the Git command to resolve any relative paths in "gitdir" files into absolute ones for us. We do need to call the Clean() function of the "path/filepath" package to ensure the directory paths are reformatted to match those returned by the Dir() function of the same package, since that is used by our legacy GetAllWorktrees() function. On Windows in particular the Clean() function will adjust the directory path separators to align with the default for the OS. Our parsing logic requires that we find at least an initial "worktree" line and a subsequent "HEAD" line in the output from the "git worktree list" command. Assuming those are found and have non-empty values, we can return a Worktree structure containing them. However, if we also find a "bare" attribute in a subsequent line, we do not return a Worktree structure for this entry, as we want to ignore bare main repositories in the same manner as our existing GetAllWorktrees() function does. Conveniently, the "git worktree list" command checks if the paths it returns actually exist, and if they do not, it includes a "prunable" attribute in its output. If we detect this attribute we then set the Prunable flag that we introduced into our Worktree structure in a prior commit in this PR.
In commit 5e654f2 in PR git-lfs#565 a pair of test assertion functions were added to the forerunner of our current t/testhelpers.sh shell library. These assert_local_object() and refute_local_object() functions check for the presence or absence of a file in the object cache maintained by the Git LFS client in a local repository. To perform these checks, the functions capture the output of the "git lfs env" command and parse the contents of the LocalMediaDir line, which reports the full path to the Git LFS object cache location. To retrieve the path, the functions ignore the first 14 characters of the line, as that corresponds to the length of the LocalMediaDir field name (13 characters) plus one character in order to account for the equals sign which follows the field name. Later PRs have added three other assertion functions that follow the same design. The delete_local_object() function was added in commit 97434fe of PR git-lfs#742 to help test the "git lfs fetch" command's --prune option, the corrupt_local_object() function was added in commit 4b0f50e of PR git-lfs#2082 to help test the detection of corrupted local objects during push operations, and most recently, the assert_remote_object() function was added in commit 9bae8eb of PR git-lfs#5905 to improve our tests of the SSH object transfer protocol for Git LFS. All of these functions retrieve the object cache location by ignoring the first 14 characters from the LocalMediaDir line in the output of the "git lfs env" command. However, the refute_local_object() function contains a hint of an alternative approach to parsing this line's data. A local "regex" variable is defined in the refute_local_object() function, which matches the LocalMediaDir field name and equals sign and captures the subsequent object cache path value. Although this "regex" variable was included when the function was first introduced, it has never been used, and does not appear in any of the other similar functions. While reviewing PR git-lfs#5905, larsxschneider suggested an even simpler option than using a regular expression to extract the object cache path from the LocalMediaDir line. Rather than asking the Bash shell to start its parameter expansion at a fixed offset of 14 characters into the string, we can define a pattern which matches the leading LocalMediaDir field name and equals sign and specify that the shell should remove that portion of the string during parameter expansion. See also the discussion in this review comment from PR git-lfs#5905: git-lfs#5905 (comment) In addition to these changes, we can remove the definition of the "regex" variable from the refute_local_object() function, as it remains unused. Co-authored-by: Lars Schneider <larsxschneider@github.com>
In our git-lfs-prune(1) manual page we dedicate one section to a explanation of how the Git LFS client determines whether an object is sufficiently "recent" to be retained and not pruned. In this section we list the Git configuration options that control the expiry and retention periods for objects, including settings such as "lfs.fetchRecentCommitsDays", "lfs.fetchRecentRefsDays", and "lfs.pruneOffsetDays". The list contains two entries, one for the "lfs.pruneOffsetDays" option and one for three of the "lfs.fetchRecent*" options. When the git-lfs-prune(1) manual page was first added in commit bd72983 of PR git-lfs#742, this list of the Git configuration options was formatted such that each entry began with the option name or names as headers on separate lines, followed by a description of their effects as a separate paragraph. When we converted our manual pages from the Ronn format to AsciiDoc in commit 0c66dcf of PR git-lfs#5054, the list was formatted as an unordered list where the option name or names and the subsequent description were simply all placed together as a single paragraph. To help make this section of the git-lfs-prune(1) manual page slightly more readable, we adjust its formatting now to use an AsciiDoc description list rather than an unordered list. By doing so, we can restore the option names as distinct headers, each of which will appear on a separate line when we render the source file into HTML or roff with Asciidoctor. The description text of each entry will also then be rendered as an indented paragraph following the entry's headers.
In our git-lfs-prune(1) manual page we dedicate one section to a explanation of how the Git LFS client determines whether an object is sufficiently "recent" to be retained and not pruned. In this section we list the Git configuration options that control the expiry and retention periods for objects, including settings such as "lfs.fetchRecentCommitsDays", "lfs.fetchRecentRefsDays", and "lfs.pruneOffsetDays". The list contains two entries, one for the "lfs.pruneOffsetDays" option and one for three of the "lfs.fetchRecent*" options. When the git-lfs-prune(1) manual page was first added in commit bd72983 of PR git-lfs#742, this list of the Git configuration options was formatted such that each entry began with the option name or names as headers on separate lines, followed by a description of their effects as a separate paragraph. When we converted our manual pages from the Ronn format to AsciiDoc in commit 0c66dcf of PR git-lfs#5054, the list was formatted as an unordered list where the option name or names and the subsequent description were simply all placed together as a single paragraph. To help make this section of the git-lfs-prune(1) manual page slightly more readable, we adjust its formatting now to use an AsciiDoc description list rather than an unordered list. By doing so, we can restore the option names as distinct headers, each of which will appear on a separate line when we render the source file into HTML or roff with Asciidoctor. The description text of each entry will also then be rendered as an indented paragraph following the entry's headers.
I think this is finally ready for review now. Please see
docs/man/git-lfs-prune.1.ronnfor details of what it does and how it determines what to retain and what to delete.I've created quite a detailed set of tests and think I've covered everything, but please do scrutinise, since this does involve deleting user data!