Skip to content

Adding parsing methods to extract LFS references from arbitrary history#559

Merged
technoweenie merged 5 commits intogit-lfs:masterfrom
sinbad:scan-log-diff
Aug 7, 2015
Merged

Adding parsing methods to extract LFS references from arbitrary history#559
technoweenie merged 5 commits intogit-lfs:masterfrom
sinbad:scan-log-diff

Conversation

@sinbad
Copy link
Contributor

@sinbad sinbad commented Aug 4, 2015

This is part of my preparatory work to support the prune command; I thought I would submit this part early both to establish its use while I continue working on the specific features, and in case it is useful elsewhere.

The focus of this PR is the parseLogOutputToPointers function which can scan a properly formatted git log output & extract relevant changes to LFS references. I've included an example of its use, ScanUnpushed() which shows what oids haven't been pushed anywhere & therefore only exist locally. However it will also be useful for other tasks (e.g. fetch --recent). The reason for it being able to scan for additions or removals is that depending on which direction in history you're scanning you might want either one (e.g. scanning backwards from a snapshot date between commits you want the '-' diffs to know what was being used beforehand).

The main feature of this method compared to alternatives is that it can extract WrappedPointers from arbitrary history with just a single git log command (and can be run in parallel with other tasks if needed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but why is this being joined with a global slice when none of the flags change? Couldn't this just use the global slice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need the global logLfsSearchArgs? I don't see it referenced anywhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both cases the answer to this is it's there for my future use :) The idea is that the local logArgs represents the history filters for this specific function (unpushed), while logLfsSearchArgs are the common arguments that will be needed for all history scans. I'm going to need the distinction and the shared global args for the other work I'm doing on fetch and they're probably useful elsewhere too. So in other functions I'll use a different set of log filtering args but still the standard logLfsSearchArgs to get the format & oid change filter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note how parseLogOutputToPointers requires that at least logLfsSearchArgs are specified in the git log command that produces the output it consumes, but other params are up to the caller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems legit 👍

@technoweenie
Copy link
Contributor

Looks good. Will this work with pointers that use extensions?

version https://git-lfs.github.com/spec/v1
ext-0-foo sha256:{original hash}
ext-1-bar sha256:{hash after foo}
oid sha256:{hash after bar}
size 123

The oid will definitely change as the extensions change, so I think it will.

@rubyist
Copy link
Contributor

rubyist commented Aug 4, 2015

This looks cool 👍

@sinbad
Copy link
Contributor Author

sinbad commented Aug 5, 2015

Looks good. Will this work with pointers that use extensions?
The oid will definitely change as the extensions change, so I think it will.

Yes, so long as the oid changes the diff will come out in the log search. However if there's a chance that more than 2 extension lines will occur before the oid line then the -U3 context won't be enough to pick up the whole pointer for decoding. Do we have an idea of the upper bar of extensions? I think I'll just use -U10 to be sure since it won't hurt to be over-cautious.

@technoweenie
Copy link
Contributor

Hmm, I don't recall if we set an upper limit on the number of extensions. I can see setting it to something really low, like 3 or 5. 10 is fine for now until we can settle on a real limit.

@sinbad
Copy link
Contributor Author

sinbad commented Aug 5, 2015

Yeah -U10 allows for 9 ext- lines before & after the oid line (the version & size lines use up the last context, assuming neither change), so this should be plenty; it doesn't make any difference for any pointers without extensions anyway. I'm not currently parsing out ext- lines into the pointer data because the current decode method can't handle them anyway. Even after it does I think they're probably not useful in this context (extensions are just for clean/smudge?).

@sinbad
Copy link
Contributor Author

sinbad commented Aug 5, 2015

Oh, I see extensions have been merged in now. I've updated this PR so ext- data is passed from the diff too.

@technoweenie
Copy link
Contributor

If you want to get the "actual" file from an lfs pointer with extensions, you'll have to parse all that out. This info is only stored in these pointer files. A plain object in .git/lfs/objects does not have the necessary info to smudge itself.

Keep in mind: extensions have not shipped. Now's the time to make changes :)

@sinbad
Copy link
Contributor Author

sinbad commented Aug 5, 2015

Hrm, that's going to make pruning (& maybe other functions) a lot more complicated if you can no longer rely on the oid == the storage location.

Doesn't this also mean that if an extension changed it could generate a different storage location for the same oid? So changing your gitconfig can completely change the storage location of all assets even if they had the same original oid?

It would be a lot simpler for many purposes if there was a 'storage oid' field in the pointer which easily identified where the data is stored, rather than having to locate the highest ext- entry. You could use it in all cases to identify what storage location this represents, even without any exts. I think the extensions PR has already broken the 'fetch unique' logic for example:

        // Only add to download queue if local file is not the right size already
        // This avoids previous case of over-reporting a requirement for files we already have
        // which would only be skipped by PointerSmudgeObject later
        if !lfs.ObjectExistsOfSize(p.Oid, p.Size) {
            q.Add(lfs.NewDownloadable(p))

Also 'Size' is presumably not reliable any more because it represents the original size of the object, not what it is after the extension chain was applied?

@sinbad
Copy link
Contributor Author

sinbad commented Aug 5, 2015

And eek, even a small screw up in your extension code can utterly and completely screw up your storage integrity, and every developer has to get it right. I'd hate to support this ;)

@sinbad
Copy link
Contributor Author

sinbad commented Aug 5, 2015

Above comments might be wrong, extensions proposal said:

The oid and size keys are calculated from the final bytes written into the LFS storage

If it's the case that p.Oid and p.Size are post-extensions then it's less of a problem (integrity of extension setup across all devs still worries me a bit). I read your comment to mean the "actual" storage location would have to be parsed out of the last ext entry. Code is a little confusing so not sure which it is.

@technoweenie
Copy link
Contributor

The OID is still the storage location, and the size matches the byte size of the object in local storage (.git/lfs/objects) or even at the server. Since it's the output of the extension chain, it could be a compressed, encrypted thing.

Now, if you want to turn the blob into the actual file during checkout, you'll either need all the pointer fields from the log, or you'll need to read the pointer from the git blob.

Doesn't this also mean that if an extension changed it could generate a different storage location for the same oid? So changing your gitconfig can completely change the storage location of all assets even if they had the same original oid?

The Git config only covers the extensions for new files. Existing files will be smudged according to the info in the pointer.

And eek, even a small screw up in your extension code can utterly and completely screw up your storage integrity, and every developer has to get it right. I'd hate to support this ;)

Yeaaaah. At least we only have to worry about the extension system working. But I think that we should advertise extensions as an "experimental" feature. I'm afraid people will rush to build compression filters based on a non deterministic algorithm.

@sinbad
Copy link
Contributor Author

sinbad commented Aug 5, 2015

The OID is still the storage location, and the size matches the byte size of the object in local storage

Phew. Thanks for the confirmation.

Now, if you want to turn the blob into the actual file during checkout, you'll either need all the pointer fields from the log, or you'll need to read the pointer from the git blob.

OK, my latest commit on this PR includes all the ext- lines so that will all still work from these results too.

Yeaaaah. At least we only have to worry about the extension system working. But I think that we should advertise extensions as an "experimental" feature. I'm afraid people will rush to build compression filters based on a non deterministic algorithm.

Agreed, all it takes is for the extension to change over time & 2 developers have different versions, or for people to choose a generic name like 'compress' and for that to have been used by different extensions :o Wild thought: maybe the extensions themselves should have an identifying SHA based on how they translate a known input string? Then if they're incompatible you know before you start trampling over real data.

@ryansimmen
Copy link
Contributor

Git LFS Extensions support 10 extensions numbered 0-9.

@ryansimmen
Copy link
Contributor

Wild thought: maybe the extensions themselves should have an identifying SHA based on how they translate a known input string? Then if they're incompatible you know before you start trampling over real data. 👍

@sinbad
Copy link
Contributor Author

sinbad commented Aug 5, 2015

Git LFS Extensions support 10 extensions numbered 0-9.

OK -U12 it is then to be absolutely sure. There's a crazy edge case with -U10 where if 10 extensions were installed, there same content was re-added but the last extension output a different result, 10 lines of context wouldn't quite be enough (there could be 11 unchanged lines including version before the oid). It'll never happen except maybe it might ;)

@technoweenie
Copy link
Contributor

This looks good, but I'd like to see some more tests. For instance, none of the tests have extensions in the diffs.

One other thing: the gitattributes file sets these files up with a custom "lfs" diff handler:

*.gif filter=lfs diff=lfs merge=lfs -crlf

It doesn't do anything yet, but I wonder how it'll affect this function if we add a custom "lfs" diff handler.

@sinbad
Copy link
Contributor Author

sinbad commented Aug 6, 2015

Good call on the diff handler, that should be explicit. I'll deal with that and add a few cases with extension lines.

@sinbad
Copy link
Contributor Author

sinbad commented Aug 7, 2015

Tests added for extension cases. I've tested the use of a custom diff tool in .gitattributes (basically just defining one that called echo) and actually it doesn't affect the output of git log, (or even git show), only that of git diff. So there is no impact on this code.

@technoweenie
Copy link
Contributor

Great, thanks for checking into that.

technoweenie added a commit that referenced this pull request Aug 7, 2015
Adding parsing methods to extract LFS references from arbitrary history
@technoweenie technoweenie merged commit 69f8241 into git-lfs:master Aug 7, 2015
@sinbad sinbad deleted the scan-log-diff branch August 7, 2015 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants