-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Objective
Make it easier for people to start using Git LFS when they already have large
objects committed into Git repositories.
Background
One of the common issues that people run into is going along the following path:
- Start with a Git repository that has large objects in the history
git lfs track *.dat- Attempt to push to a remote, have the push fail a
post-receivehook, and
stop using Git LFS.
The underlying problem is an expectation that git lfs track will convert
previously existing large Git blobs to LFS objects, which it does not do. git lfs track only tracks object within LFS after those changes to the
.gitattributes file have been cached.
What we have recommended thus far is to either run git-filter-branch(1) or
download an external tool with the same purpose (one example of this is BFG:
https://github.com/rtyley/bfg-repo-cleaner).
Solution
I would like to make it possible for Git LFS to do this conversion without
needing to run a complicated git-filter-branch(1) command or use an external
tool. By implementing this in Git LFS ourselves, we have control of the behavior
of the migration tool, which would allow us to implement features like:
- Forward and backward (to and from LFS objects) migrations
- Specific file-path matching
- Fast, LFS-specific object caching
If we could bundle a migration tool to convert large blobs in Git history to LFS
objects within Git LFS itself, we would also be able to:
- Make it easier for users to adopt Git LFS
- Cut down on a class of bug reports where
git lfs trackis run against large
objects that already exist
Flags
I'd like the git-lfs-migrate(1) command to have the following options:
-I,-X(or--includeand--exclude): our standard pattern matching
flags to dictate which objects should and shouldn't be tracked by Git LFS
A non-goal here is --greater-than= or --less-than= size flags, since this
type of pattern matching is not support in the attributes specification at
https://git-scm.com/docs/gitattributes.
Sub-commands
At first I thought it would be unnecessary to supply a "reverse" (from LFS
object to Git large object) migration, because the migration command would be
able to build a parallel DAG of commit objects and then do a git-update-ref(1)
at the end, producing a git reflog like:
~/example-repo (master) $ git reflog
dc0b3c60 HEAD@{1}: checkout: moving from master-lfs-migrate to master
115a1a1 HEAD@{2}: commit: tools/time_tools: test tools.IsExpiredAtOrIn
...Meaning that the user could do a git-reset(1) (with the --hard) flag to move
their repository back to its original state.
This approach falls short when dealing with a scenario where the original
repository (with the dangling refs) has been removed. Similarly, it does not
deal with a git gc or git prune --expires=now. In other words, if the old,
large Git objects are gone, there is no way to restore your repository.
That being said, I'd like to have both a:
git lfs migrate up, and- a
git lfs migrate down
to go to and from storing objects in LFS respectively.
Technical Overview
The git-lfs-migrate(1) command should build a parallel commit DAG by walking
the history in topological (--topo) ordering and then update the refs all at
once at the end.
Step-by-step, this would be (for the forward conversion):
- For each ref:
- For each file in that ref matching the
-Iand-Xflags:
1. Iteratively calculate the shasum (256) of that file's contents
2. Generate (and optionally cache) a pointer corresponding to that file
3. Write the pointer to disk
4. Callgit-update-index(1)to replace that file in the index of the
current ref. - Call
git-write-tree(1), attaching the appropriate parents to the current
"new" commit. - Call
git-update-ref(1)to move the refs from the old history to the new
one.
And for the backward conversion:
- For each ref:
- For each file in that ref that is parse-able as a pointer and matches the
-Iand-Xflag(s).
1. Fetch anio.Readerhandle on the contents of the file in the cache
2. Write that file to disk
3. Callgit-update-index(1)to replace that file in the index of the
current ref. - Call
git-write-tree(1), attaching the appropriate parents to the current
"new" commit. - Call
git-update-ref(1)to move the refs from the old history to the new
one.
Optimizations
I can think of a few ways to optimize the operation of the git-lfs-migrate(1)
command:
- Instead of a full tree-scan:
- Calculate the diff between each pair of consecutive revisions
- Replace all pointers from the last revision that were unchanged
- Update (or create) all pointers for new or changed files that match the
-Iand-Xflag(s). - Instead of reading the entire contents of a file each time, cache it based on
the OID it has in Git's ODB. - Parallelize the tree-walk and re-assemble the DAG in chunks, using the same
diffing and caching strategies as above.
/cc @git-lfs/core for thoughts and/or concerns
/cc @peff for thoughts on the technical overview and additional insight into possible optimizations