Skip to content

proposal: migrate sub-command #2146

@ttaylorr

Description

@ttaylorr

Objective

Make it easier for people to start using Git LFS when they already have large
objects committed into Git repositories.

Background

One of the common issues that people run into is going along the following path:

  1. Start with a Git repository that has large objects in the history
  2. git lfs track *.dat
  3. Attempt to push to a remote, have the push fail a post-receive hook, and
    stop using Git LFS.

The underlying problem is an expectation that git lfs track will convert
previously existing large Git blobs to LFS objects, which it does not do. git lfs track only tracks object within LFS after those changes to the
.gitattributes file have been cached.

What we have recommended thus far is to either run git-filter-branch(1) or
download an external tool with the same purpose (one example of this is BFG:
https://github.com/rtyley/bfg-repo-cleaner).

Solution

I would like to make it possible for Git LFS to do this conversion without
needing to run a complicated git-filter-branch(1) command or use an external
tool. By implementing this in Git LFS ourselves, we have control of the behavior
of the migration tool, which would allow us to implement features like:

  1. Forward and backward (to and from LFS objects) migrations
  2. Specific file-path matching
  3. Fast, LFS-specific object caching

If we could bundle a migration tool to convert large blobs in Git history to LFS
objects within Git LFS itself, we would also be able to:

  1. Make it easier for users to adopt Git LFS
  2. Cut down on a class of bug reports where git lfs track is run against large
    objects that already exist

Flags

I'd like the git-lfs-migrate(1) command to have the following options:

  1. -I, -X (or --include and --exclude): our standard pattern matching
    flags to dictate which objects should and shouldn't be tracked by Git LFS

A non-goal here is --greater-than= or --less-than= size flags, since this
type of pattern matching is not support in the attributes specification at
https://git-scm.com/docs/gitattributes.

Sub-commands

At first I thought it would be unnecessary to supply a "reverse" (from LFS
object to Git large object) migration, because the migration command would be
able to build a parallel DAG of commit objects and then do a git-update-ref(1)
at the end, producing a git reflog like:

~/example-repo (master) $ git reflog
dc0b3c60 HEAD@{1}: checkout: moving from master-lfs-migrate to master
115a1a1 HEAD@{2}: commit: tools/time_tools: test tools.IsExpiredAtOrIn
...

Meaning that the user could do a git-reset(1) (with the --hard) flag to move
their repository back to its original state.

This approach falls short when dealing with a scenario where the original
repository (with the dangling refs) has been removed. Similarly, it does not
deal with a git gc or git prune --expires=now. In other words, if the old,
large Git objects are gone, there is no way to restore your repository.

That being said, I'd like to have both a:

  1. git lfs migrate up, and
  2. a git lfs migrate down

to go to and from storing objects in LFS respectively.

Technical Overview

The git-lfs-migrate(1) command should build a parallel commit DAG by walking
the history in topological (--topo) ordering and then update the refs all at
once at the end.

Step-by-step, this would be (for the forward conversion):

  1. For each ref:
  2. For each file in that ref matching the -I and -X flags:
    1. Iteratively calculate the shasum (256) of that file's contents
    2. Generate (and optionally cache) a pointer corresponding to that file
    3. Write the pointer to disk
    4. Call git-update-index(1) to replace that file in the index of the
    current ref.
  3. Call git-write-tree(1), attaching the appropriate parents to the current
    "new" commit.
  4. Call git-update-ref(1) to move the refs from the old history to the new
    one.

And for the backward conversion:

  1. For each ref:
  2. For each file in that ref that is parse-able as a pointer and matches the
    -I and -X flag(s).
    1. Fetch an io.Reader handle on the contents of the file in the cache
    2. Write that file to disk
    3. Call git-update-index(1) to replace that file in the index of the
    current ref.
  3. Call git-write-tree(1), attaching the appropriate parents to the current
    "new" commit.
  4. Call git-update-ref(1) to move the refs from the old history to the new
    one.

Optimizations

I can think of a few ways to optimize the operation of the git-lfs-migrate(1)
command:

  1. Instead of a full tree-scan:
  2. Calculate the diff between each pair of consecutive revisions
  3. Replace all pointers from the last revision that were unchanged
  4. Update (or create) all pointers for new or changed files that match the
    -I and -X flag(s).
  5. Instead of reading the entire contents of a file each time, cache it based on
    the OID it has in Git's ODB.
  6. Parallelize the tree-walk and re-assemble the DAG in chunks, using the same
    diffing and caching strategies as above.

/cc @git-lfs/core for thoughts and/or concerns
/cc @peff for thoughts on the technical overview and additional insight into possible optimizations

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions