prune: fix deleting objects referred to by stashes#4209
prune: fix deleting objects referred to by stashes#4209chrisd8088 merged 16 commits intogit-lfs:masterfrom
Conversation
It was impossible to pop a stash with LFS data successfully after running any lfs prune command before this fix, because prune would consider stashed data unreferenced. This fixes git-lfs#4206
|
FYI: I wrote the |
|
Hey, thanks for both the bug report and the proposed patch! I was intending to look at the bug report today, and am delighted to have a potential solution in hand before I even do so! 😄 On a first glance this seems reasonable, but to get the CI jobs to pass I think we'll need to fix the trailing whitespace issue in Thanks again for the patch; that's always appreciated! |
|
Ah yes, sorry I ran the tests individually locally rather than the full CI build, I forgot to check the trailing new line. Due to time zone I won’t be able to fix this until tomorrow but feel free to patch it up yourself if you want / have time. :) |
|
Might need some help with this, I've tried to please the cibuild god but it keeps complaining about whitespace; I've added a couple of newlines but no luck. Its been a few years since I've had to bash my head against it so maybe there's a trick I've forgotten :) |
chrisd8088
left a comment
There was a problem hiding this comment.
Thank you so much for putting this together; I would have had to start from zero and this is so much better than that!
I noted a few things I discovered while poking around; see my inline comments for those ideas and concerns.
In addition, one other change we should include is an update to the man page where it lists the places it checks for "active" refs which should not be pruned.
Oh, and thank you also very much for including a test!! ❤️
lfs/gitscanner_log.go
Outdated
| // First get the SHAs of all stashes | ||
| // git reflog show --format="%H" stash | ||
| reflogArgs := []string{"show", "--format=%H", "stash"} | ||
|
|
||
| cmd, err := git.RefLog(reflogArgs...) | ||
| if err != nil { | ||
| return err | ||
| } | ||
|
|
||
| cmd.Start() | ||
| defer cmd.Wait() | ||
|
|
||
| scanner := bufio.NewScanner(cmd.Stdout) | ||
|
|
||
| for scanner.Scan() { | ||
| err = s.ScanRef(strings.TrimSpace(scanner.Text()), cb) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| } | ||
|
|
||
| return nil |
There was a problem hiding this comment.
After a bit of a deeper look, I had a few notes on this scanStashed() function.
The first thing which puzzled me for a while was that it seemed to succeed in a freshly initialized, empty repo with no stashes, where .git/refs/stash doesn't exist yet. I couldn't get git reflog show stash to work on the command line, but it seemed to be OK when run this way.
But the reason why is that the possible error exit code returned by cmd.Wait() is not being checked. If you add to the end of this function something like:
stderr, _ := ioutil.ReadAll(cmd.Stderr)
err = cmd.Wait()
if err != nil {
return fmt.Errorf("error in git reflog: %v %v", err, string(stderr))
}then it does fail in a repo where there's no refs/stash ref, or if you change "stash" to "foobar" or something. So that's the first issue, that the Git command's exit code is not checked.
Looking at how the git stash list command works, I believe it starts by checking for an extant refs/stash ref, and exiting early if it doesn't exist.
So we probably need something similar here once we're checking the Git command's exit code; otherwise we'll bail with Prune error: error in git reflog: exit status 128 fatal: ambiguous argument 'stash': unknown revision or path not in the working tree. (That's what I get once I added the check above on cmd.Wait()'s return value, plus failures in the t/t-prune.sh and other tests.)
Beyond that, I looking again at how git stash list uses the git log -g refs/stash -- command, I wondered if we could avoid the need for a new git.RefLog() function in git/git.go entirely and just use git.Log()?
There was a problem hiding this comment.
I worked up a tentative first half of an alternate version of this function, which checks for refs/stash first, and then uses git.Log(). The hard part, which I skipped for now, is reading the output; I thought that could borrow from the design of parseScannerLogOutput() but without the need for a direction parameter.
func scanStashed(cb GitScannerFoundPointer, s *GitScanner) error {
if _, err := git.ResolveRef("refs/stash"); err != nil {
return nil
}
logArgs := []string{"-g"}
// Add standard search args to find lfs references
logArgs = append(logArgs, logLfsSearchArgs...)
// Search reflog (with -g) for refs/stash
logArgs = append(logArgs, "refs/stash", "--")
cmd, err := git.Log(logArgs...)
if err != nil {
return err
}There was a problem hiding this comment.
Hmm, personally, I think the original version is more elegant - I know it's not checking the result code, but this is a pattern that already exists in the codebase (see RemoteList, RemoteURLs, RecentBranches). The only reason for this command to fail (and others to not) is that there are no stashes to list.
I didn't use git log because there's no need to "walk" the history - all stashes are singular commits. reflog is more efficient to get the data needed and I thought the RefLog command might be useful later on. But I don't mind if you'd prefer to write it another way.
There was a problem hiding this comment.
Ah, I see the similarity to git.RemoteList() and friends now; I had not thought to look at those, only at the other functions nearby. Well, given that, maybe there's a tidy way to combine the two approaches and get a minimal set of changes?
Here's what seems to be working for me tonight, replacing just the git reflog show with git log -g, i.e., something like:
// git log -g --format="%H" refs/stash --
logArgs := []string{"-g", "--format=%H", "refs/stash", "--"}
cmd, err := git.Log(logArgs...)I can't claim to be an expert on the exact functional differences between those two commands (but I'll ask an expert :-) ... however, if they're effectively the same for this purpose, that would allow for reuse of the git.Log() function and avoid the need to add git.RefLog(). Although I confess this is a very small thing, really; I could go either way (particularly if it turns out there's a specific advantage to the git reflog command)!
Thanks again so much for working on this; it's very much appreciated.
There was a problem hiding this comment.
OK you're right I think -g is basically an alias for reflog and makes log behave very differently. Unfortunately for the same reason -g makes some of the other options in logLfsSearchArgs to git log stop working, such as the -G "oid sha256:" search, so we can't combine everything into one git log and send that to parseScannerLogOutput. So you're still down to executing 2 commands, albeit that both of them are now git log so the RefLog command isn't needed.
There was a problem hiding this comment.
Yeah, I tossed out my attempt to use logLfsSearchArgs; that was a non-starter. All I was thinking of was changing the first couple of lines of this function to use git.Log() and the git log -g ... command format. (And according to the folks I consulted, git log -g is equivalent to git reflog show.
That said, based on the feedback I got, I realized there may be a different, more "interesting" issue yet to solve here, which is what happens when someone uses git stash pop --index and had LFS file changes in both their index and their working tree. I'll write that up in the main comments, though.
git/git.go
Outdated
| func RefLog(args ...string) (*subprocess.BufferedCmd, error) { | ||
| reflogArgs := append([]string{"reflog"}, args...) | ||
| return gitNoLFSBuffered(reflogArgs...) | ||
| } | ||
|
|
There was a problem hiding this comment.
See my other inline comment on scanStashed() for some musings on whether we could dodge the need for this new function by using git.Log() to read the stashed refs instead.
|
What was pointed out to me today is that the stash uses multiple commits to store the state of the working tree and the index separately (thanks, @bk2204!), and our various experiments with I wrote up a test which demonstrates the outstanding issue here, I believe; at least for me, using this PR as it stands, it failed on the second begin_test "prune keep stashed changes in index"
(
set -e
reponame="prune_keep_stashed_index"
setup_remote_repo "remote_$reponame"
clone_repo "remote_$reponame" "clone_$reponame"
git lfs track "*.dat" 2>&1 | tee track.log
grep "Tracking \"\*.dat\"" track.log
# generate content we'll use
content_inrepo="This is the original committed data"
oid_inrepo=$(calc_oid "$content_inrepo")
content_indexstashed="This data will be stashed from the index and should not be deleted"
oid_indexstashed=$(calc_oid "$content_indexstashed")
content_stashed="This data will be stashed and should not be deleted"
oid_stashed=$(calc_oid "$content_stashed")
# We just need one commit of base data, makes it easier to test stash
echo "[
{
\"CommitDate\":\"$(get_date -1d)\",
\"Files\":[
{\"Filename\":\"stashedfile.dat\",\"Size\":${#content_inrepo}, \"Data\":\"$content_inrepo\"}]
}
]" | lfstest-testutils addcommits
# now modify the file, and add it to the index
echo -n "$content_indexstashed" > stashedfile.dat
git add stashedfile.dat
# now modify the file again, and stash it
echo -n "$content_stashed" > stashedfile.dat
git stash
# Prove that the stashed data was stored in LFS (should call clean filter)
assert_local_object "$oid_stashed" "${#content_stashed}"
assert_local_object "$oid_indexstashed" "${#content_indexstashed}"
# Prune data, should NOT delete stashed file or stashed changes to index
git lfs prune
assert_local_object "$oid_stashed" "${#content_stashed}"
assert_local_object "$oid_indexstashed" "${#content_indexstashed}"
# Restore working tree from stash
git stash pop --index
# Reset working tree to index from stash
git checkout .
)
end_testSo I think we should add a test like this, first of all, and also consider how best to walk back through all the SHAs associated with the stash. |
|
Nice catch, I never realised stash could create >1 commit but thinking about it I guess it makes sense. I guess it can only ever create 2. I'll have another crack at it. |
|
Here's one failed attempt that I thought would work: func scanStashed(cb GitScannerFoundPointer, s *GitScanner) error {
// First get the SHAs of all stashes
logArgs := []string{"-g", "--format=%h", "refs/stash", "--"}
cmd, err := git.Log(logArgs...)
if err != nil {
return err
}
cmd.Start()
defer cmd.Wait()
scanner := bufio.NewScanner(cmd.Stdout)
var stashShas []string
for scanner.Scan() {
stashShas = append(stashShas, strings.TrimSpace(scanner.Text()))
}
// Stashes are 2 commits, one for index and one for working copy
// Preserve changes in both separately, in case same file modified twice
logArgs = stashShas // include all stashes
logArgs = append(logArgs, "--not", "--branches", "--tags") // exclude those commits reachable locally otherwise
// Add standard search args to find lfs references
logArgs = append(logArgs, logLfsSearchArgs...)
cmd, err = git.Log(logArgs...)
if err != nil {
return err
}
parseScannerLogOutput(cb, LogDiffAdditions, cmd)
return nil
}While this does list the correct commits, the diff generated by the Looking at the result of - oid sha256:67ae8ef97d615140fcb42b3fa7e300a5ce5960cd8ff9807eef3dcb85aec3cecd
- size 25
-oid sha256:32609ffbe33e263e807f7e3b8ee6ae7d990da65de5c15a51f1b62782a8cc7276
-size 43
++oid sha256:c9d22606d40fec4e99cc7d09fbef109f5301a30e35c177e0d14bafb36894cdc5
++size 64This looks more like a FYI for stash commits that only had the working copy, --oid sha256:67ae8ef97d615140fcb42b3fa7e300a5ce5960cd8ff9807eef3dcb85aec3cecd
--size 25
++oid sha256:c7fe2c1adc4c9fbe8df7d1bd56b85d98e52288ffc6547fcb7a15fab39a4942ed
++size 15So I think the best approach is going to be to use |
|
Sorry I pasted the wrong "failed attempt" function in the previous comment at first, fixed now. |
|
One other thing we should consider (again, based on the tips I got about how Doing something like: echo 123 > foo.dat; git add foo.dat
git commit -m initial
echo 456 > foo.dat
git add foo.dat
echo 789 > foo.dat
echo abc > bar.dat
git stash push -u
git log -g refs/stashshows something like the following, where the three SHAs in the and the merge commit itself has the stashed changes to the tracked files from the working tree. |
|
OK that suggests that we need both commands. To summarise:
The set of OIDs from all of those should be the complete set. |
|
Alternatively if we used the "Merge" line: In my case a70e14d is the parent commit of the stash (which should be ignored), 4cf30f4 is the index contents and 1bad043 is the untracked files. The leaf SHA is then the working copy. If that's a consistent we could ignore the first SHA on the merge line and use the rest plus the leaf as arguments to |
|
Yes, exactly -- I realized I needed to edit my description of which commits were which in #4209 (comment) and then noticed you'd already explained the relevant commits more correctly. :-) |
|
Bad news, I'm not sure exactly what version of git started reporting merge parents for stashes, but it was somewhere later than git 2.7 because during my testing on an older linux setup with older git, I couldn't get |
|
For reference in this git 2.7 the parentage of the commits are still the same, it's just that
|
|
OK this took more a lot more work than I expected, but it's finally working. I had to add a "Show" command and I had to also change the OID pointer diff parsing because it couldn't handle merge commits. I added a test which includes all 3 kinds of stash change at once because there was a bug that only happened when you had all 3 together. Hoping this is all OK now! |
|
OK, wow, that's a lot of work indeed, thank you! I plan to take a look tomorrow but from the tests it looks like this covers a lot of territory and all the use cases. ❤️ |
|
Thanks so much for sticking with this! I just wanted to say I'm still mid-review and have not forgotten this, and will get you more feedback as soon as I can. Thanks so much again! |
chrisd8088
left a comment
There was a problem hiding this comment.
This is definitely heading in the right direction; thank you very much for the full test suite (yay!) and the manual page change, as well as all the implementation work!
I wondered about two things in particular, one of which I've tried in 7559392 and which seemed to work (avoiding the need to parse combined diffs), and the other of which I only speculated on, which is maybe using the same git log -m trick to just get all the necessary data from the SHAs we get from the initial reflog query (the git log -g one).
I will try to get to some experimentation with that latter idea, but may not achieve it for a day or more. My apologies for the slow responses here, and thank you so much again for all the time and effort getting this into shape (not to mention reporting it in the first place!)
| any which are not referenced by at least ONE of the following: | ||
|
|
||
| * the current checkout | ||
| * all existing stashes |
When looking for recent objects to retain during a "git lfs prune", we can parse the output from "git show" on the SHAs from merge commits generated by "git stash" without needing to parse the combined diff format if we pass the -m flag to "git show". This forces Git to generate diffs of the individual parents of the merge; see https://git-scm.com/docs/git-show#_combined_diff_format for details. This allows us to simplify some of the changes from commit 4049928 and keep the scan() and newLogScanner() functions unchanged from before, when they only accepted traditional unified diffs, not combined diffs.
When looking for recent objects to retain during a "git lfs prune", we use "git log -m" to collect diffs of all stash commits, including all the parents of the WIP merge commits from the reflog, where index changes and untracked files are recorded for each stash. This allows us to further simplify some of the changes from commit 4049928 and avoid the need to introduce a git.Show() function to use "git show"; it also lets us iterate just twice, once over the reflog for refs/stash to get the SHAs of the WIP merge commits, and then over all the individual parent diffs output by "git log -m" for those merges.
chrisd8088
left a comment
There was a problem hiding this comment.
I'm going to approve this as it now stands, but also likely ask @bk2204 to take a second look, as I've added some of my own work here.
|
Looks good to me! 👍 |
bk2204
left a comment
There was a problem hiding this comment.
Overall, looks great. I found two minor things that we might want to fix, but I'm pretty happy with the way this turned out.
Per PR advice from bk2204, we switch to using printf instead of echo in the t-prune.go tests of stashed changes. We also fixup one code comment to match the count of processes in commands/command_prune.go.
The "prune keep stashed changes", "prune keep stashed changes in index", and "prune keep stashed untracked files" tests ensure that Git LFS objects referenced by stashes are always retained when the "git lfs prune" command is run. However, they do not at present check that objects belonging to files which match a "lfs.fetchexclude" pattern, which would otherwise be pruned regardless of their status, will be retained because they have been stashed. Logic in the "git lfs prune" command to retain objects from stashed changes was added in PR git-lfs#4209, and it should demonstrate this behaviour, but no checks actually confirm it. We therefore expand these three tests so that their simulated commit histories include objects whose file path matches a path we then exclude using the "lfs.fetchexclude" filter. We then confirm that all these objects are retained by "git lfs prune" when they are stashed, included when they are stashed from the index and when they are untracked files stashed with "git stash -u".
The "prune keep stashed changes", "prune keep stashed changes in index", and "prune keep stashed untracked files" tests ensure that Git LFS objects referenced by stashes are always retained when the "git lfs prune" command is run. However, they do not at present check that objects belonging to files which match a "lfs.fetchexclude" pattern, which would otherwise be pruned regardless of their status, will be retained because they have been stashed. Logic in the "git lfs prune" command to retain objects from stashed changes was added in PR git-lfs#4209, and it should demonstrate this behaviour, but no checks actually confirm it. We therefore expand these three tests so that their simulated commit histories include objects whose file path matches a path we then exclude using the "lfs.fetchexclude" filter. We then confirm that all these objects are retained by "git lfs prune" when they are stashed, included when they are stashed from the index and when they are untracked files stashed with "git stash -u".
The "prune keep stashed changes", "prune keep stashed changes in index", and "prune keep stashed untracked files" tests ensure that Git LFS objects referenced by stashes are always retained when the "git lfs prune" command is run. However, they do not at present check that objects belonging to files which match a "lfs.fetchexclude" pattern, which would otherwise be pruned regardless of their status, will be retained because they have been stashed. Logic in the "git lfs prune" command to retain objects from stashed changes was added in PR git-lfs#4209, and it should demonstrate this behaviour, but no checks actually confirm it. We therefore expand these three tests so that their simulated commit histories include objects whose file path matches a path we then exclude using the "lfs.fetchexclude" filter. We then confirm that all these objects are retained by "git lfs prune" when they are stashed, included when they are stashed from the index and when they are untracked files stashed with "git stash -u".
The "prune keep stashed changes", "prune keep stashed changes in index", and "prune keep stashed untracked files" tests ensure that Git LFS objects referenced by stashes are always retained when the "git lfs prune" command is run. However, they do not at present check that objects belonging to files which match a "lfs.fetchexclude" pattern, which would otherwise be pruned regardless of their status, will be retained because they have been stashed. Logic in the "git lfs prune" command to retain objects from stashed changes was added in PR git-lfs#4209, and it should demonstrate this behaviour, but no checks actually confirm it. We therefore expand these three tests so that their simulated commit histories include objects whose file path matches a path we then exclude using the "lfs.fetchexclude" filter. We then confirm that all these objects are retained by "git lfs prune" when they are stashed, included when they are stashed from the index and when they are untracked files stashed with "git stash -u".
The "prune keep stashed changes", "prune keep stashed changes in index", and "prune keep stashed untracked files" tests ensure that Git LFS objects referenced by stashes are always retained when the "git lfs prune" command is run. However, they do not at present check that objects belonging to files which match a "lfs.fetchexclude" pattern, which would otherwise be pruned regardless of their status, will be retained because they have been stashed. Logic in the "git lfs prune" command to retain objects from stashed changes was added in PR git-lfs#4209, and it should demonstrate this behaviour, but no checks actually confirm it. We therefore expand these three tests so that their simulated commit histories include objects whose file path matches a path we then exclude using the "lfs.fetchexclude" filter. We then confirm that all these objects are retained by "git lfs prune" when they are stashed, included when they are stashed from the index and when they are untracked files stashed with "git stash -u".
In commit 4049928 of PR git-lfs#4209 the scanStashed() function of the "lfs" package was revised to no longer make use of the *GitScanner argument that was part of its original implementation from commit 2dc718b in the same PR. However, the argument was never dropped from the function's signature, so we can remove it now.
In commit 4049928 of PR git-lfs#4209 the scanStashed() function of the "lfs" package was revised to no longer make use of the *GitScanner argument that was part of its original implementation from commit 2dc718b in the same PR. However, the argument was never dropped from the function's signature, so we can remove it now.
It was impossible to pop a stash with LFS data successfully
after running any lfs prune command before this fix, because
prune would consider stashed data unreferenced.
This fixes #4206