proposal: `migrate` sub-command

## Objective

Make it easier for people to start using Git LFS when they already have large
objects committed into Git repositories.

## Background

One of the common issues that people run into is going along the following path:

1. Start with a Git repository that has large objects in the history
2. `git lfs track *.dat`
3. Attempt to push to a remote, have the push fail a `post-receive` hook, and
stop using Git LFS.

The underlying problem is an expectation that `git lfs track` will convert
previously existing large Git blobs to LFS objects, which it does not do. `git
lfs track` only tracks object within LFS _after_ those changes to the
`.gitattributes` file have been cached.

What we have recommended thus far is to either run `git-filter-branch(1)` or
download an external tool with the same purpose (one example of this is BFG:
https://github.com/rtyley/bfg-repo-cleaner).

## Solution

I would like to make it possible for Git LFS to do this conversion without
needing to run a complicated `git-filter-branch(1)` command or use an external
tool. By implementing this in Git LFS ourselves, we have control of the behavior
of the migration tool, which would allow us to implement features like:

1. Forward and backward (to and from LFS objects) migrations
2. Specific file-path matching
3. Fast, LFS-specific object caching

If we could bundle a migration tool to convert large blobs in Git history to LFS
objects within Git LFS itself, we would also be able to:

1. Make it easier for users to adopt Git LFS
2. Cut down on a class of bug reports where `git lfs track` is run against large
objects that already exist

### Flags

I'd like the `git-lfs-migrate(1)` command to have the following options:

1. `-I`, `-X` (or `--include` and `--exclude`): our standard pattern matching
flags to dictate which objects should and shouldn't be tracked by Git LFS

A non-goal here is `--greater-than=` or `--less-than=` size flags, since this
type of pattern matching is not support in the attributes specification at
https://git-scm.com/docs/gitattributes.

### Sub-commands

At first I thought it would be unnecessary to supply a "reverse" (from LFS
object to Git large object) migration, because the migration command would be
able to build a parallel DAG of commit objects and then do a `git-update-ref(1)`
at the end, producing a `git reflog` like:

```bash
~/example-repo (master) $ git reflog
dc0b3c60 HEAD@{1}: checkout: moving from master-lfs-migrate to master
115a1a1 HEAD@{2}: commit: tools/time_tools: test tools.IsExpiredAtOrIn
...
```

Meaning that the user could do a `git-reset(1)` (with the `--hard`) flag to move
their repository back to its original state.

This approach falls short when dealing with a scenario where the original
repository (with the dangling refs) has been removed. Similarly, it does not
deal with a `git gc` or `git prune --expires=now`. In other words, if the old,
large Git objects are gone, there is no way to restore your repository.

That being said, I'd like to have both a:

1. `git lfs migrate up`, and
2. a `git lfs migrate down`

to go to and from storing objects in LFS respectively.

## Technical Overview

The `git-lfs-migrate(1)` command should build a parallel commit DAG by walking
the history in topological (`--topo`) ordering and then update the refs all at
once at the end.

Step-by-step, this would be (for the forward conversion):

1. For each ref:
  1. For each file in that ref matching the `-I` and `-X` flags:
    1. Iteratively calculate the shasum (256) of that file's contents
    2. Generate (and optionally cache) a pointer corresponding to that file
    3. Write the pointer to disk
    4. Call `git-update-index(1)` to replace that file in the index of the
       current ref.
  2. Call `git-write-tree(1)`, attaching the appropriate parents to the current
     "new" commit.
3. Call `git-update-ref(1)` to move the refs from the old history to the new
   one.

And for the backward conversion:

1. For each ref:
  1. For each file in that ref that is parse-able as a pointer and matches the
     `-I` and `-X` flag(s).
    1. Fetch an `io.Reader` handle on the contents of the file in the cache
    2. Write that file to disk
    3. Call `git-update-index(1)` to replace that file in the index of the
       current ref.
  2. Call `git-write-tree(1)`, attaching the appropriate parents to the current
     "new" commit.
3. Call `git-update-ref(1)` to move the refs from the old history to the new
   one.

### Optimizations

I can think of a few ways to optimize the operation of the `git-lfs-migrate(1)`
command:

1. Instead of a full tree-scan:
  1. Calculate the diff between each pair of consecutive revisions
  2. Replace all pointers from the last revision that were unchanged
  3. Update (or create) all pointers for new or changed files that match the
     `-I` and `-X` flag(s).
2. Instead of reading the entire contents of a file each time, cache it based on
   the OID it has in Git's ODB.
3. Parallelize the tree-walk and re-assemble the DAG in chunks, using the same
   diffing and caching strategies as above.

---

/cc @git-lfs/core for thoughts and/or concerns
/cc @peff for thoughts on the technical overview and additional insight into possible optimizations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: `migrate` sub-command #2146

Objective

Background

Solution

Flags

Sub-commands

Technical Overview

Optimizations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

proposal: migrate sub-command #2146

Description

Objective

Background

Solution

Flags

Sub-commands

Technical Overview

Optimizations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

proposal: `migrate` sub-command #2146