21

Let's say you have the repository:

myCode/megaProject/moduleA
myCode/megaProject/moduleB

Over time (months), you re-organise the project. Refactoring the code to make the modules independent. Files in the megaProject directory get moved into their own directories. Emphasis on move - the history of these files is preserved.

myCode/megaProject
myCode/moduleA
myCode/moduleB

Now you wish to move these modules to their own GIT repos. Leaving the original with just megaProject on its own.

myCode/megaProject
newRepoA/moduleA
newRepoB/moduleB

The filter-branch command is documentated to do this but it doesn't follow history when files were moved outside of the target directory. So the history begins when the files were moved into their new directory, not the history the files had then they lived in the old megaProject directory.

How to split a GIT history based on a target directory, and, follow history outside of this path - leaving only commit history related to these files and nothing else?

The numerous other answers on SO focus on generally splitting apart the repo - but make no mention of splitting apart and following the move history.

6 Answers 6

11

This is a version based on @rksawyer's scripts, but it uses git-filter-repo instead. I found it was much easier to use and much much faster than git-filter-branch (and is now recommended by git as a replacement).

# This script should run in the same folder as the project folder is.
# This script uses git-filter-repo (https://github.com/newren/git-filter-repo).
# The list of files and folders that you want to keep should be named <your_repo_folder_name>_KEEP.txt. I should contain a line end in the last line, otherwise the last file/folder will be skipped.
# The result will be the folder called <your_repo_folder_name>_REWRITE_CLONE. Your original repo won't be changed.
# Tags are not preserved, see line below to preserve tags.
# Running subsequent times will backup the last run in <your_repo_folder_name>_REWRITE_CLONE_BKP.

# Define here the name of the folder containing the repo: 
GIT_REPO="git-test-orig"

clone="$GIT_REPO"_REWRITE_CLONE
temp=/tmp/git_rewrite_temp
rm -Rf "$clone"_BKP
mv "$clone" "$clone"_BKP
rm -Rf "$temp"
mkdir "$temp"
git clone "$GIT_REPO" "$clone"
cd "$clone"
git remote remove origin
open .
open "$temp"

# Comment line below to preserve tags
git tag | xargs git tag -d

echo 'Start logging file history...'
echo "# git log results:\n" > "$temp"/log.txt

while read p
do
    shopt -s dotglob
    find "$p" -type f > "$temp"/temp
    while read f
    do
        echo "## " "$f" >> "$temp"/log.txt
        # print every file and follow to get any previous renames
        # Then remove blank lines.  Then remove every other line to end up with the list of filenames       
        git log --pretty=format:'%H' --name-only --follow -- "$f" | awk 'NF > 0' | awk 'NR%2==0' | tee -a "$temp"/log.txt
        
        echo "\n\n" >> "$temp"/log.txt
    done < "$temp"/temp
done < ../"$GIT_REPO"_KEEP.txt > "$temp"/PRESERVE

mv "$temp"/PRESERVE "$temp"/PRESERVE_full
awk '!a[$0]++' "$temp"/PRESERVE_full > "$temp"/PRESERVE

sort -o "$temp"/PRESERVE "$temp"/PRESERVE

echo 'Starting filter-branch --------------------------'
git filter-repo --paths-from-file "$temp"/PRESERVE --force --replace-refs delete-no-add
echo 'Finished filter-branch --------------------------'

It logs the result of git log into a file in /tmp/git_rewrite_temp/log.txt, so you can get rid of these lines if you don't need a log.txt and want it to run faster.

Sign up to request clarification or add additional context in comments.

6 Comments

Awesome example of the use of an awesome tool! After a day of troubles with filter-branch, running for 40 minutes only not to work, this solved it correctly in about 5 seconds.
I had some messy old, empty commits, so I ended up adding --prune-empty alwaysto the filter-repo command.
The auto setting will prune all commits that end up as empty when rewriting the repo. In my case, I guess I have actual empty commits. They seem to originate from the repo before it was git (svn), and probably wound up empty for some reason, either through svn being svn, or in the migration to git. Anyways, no reason to keep the commits, and they should probably just be removed from the original repo itself.
I'm kind of new to git-filter-repo, but reading through the documentation, shouldn't git filter-repo --analyze be able to give you information on renames?
I found your shell script version a little too different from what I'd have implemented to feel comfortable with it, so I wrote one in Python which behaves more similarly to bare git-filter-repo, has --help, and has a bunch of safety guards. I'm not sure what would be the most appropriate way to make it its own answer in this particular case. (It's a Gist, but it's also too long to code-block here IMO.)
|
4

Running git filter-branch --subdirectory-filter in your cloned repository will remove all commits that don't affect content in that subdirectory, which includes those affecting the files before they were moved.

Instead, you need to use the --index-filter flag with a script to delete all files you're not interested in, and the --prune-empty flag to ignore any commits affecting other content.

There's a blog post from Kevin Deldycke with a good example of this:

git filter-branch --prune-empty --tree-filter 'find ./ -maxdepth 1 -not -path "./e107*" -and -not -path "./wordpress-e107*" -and -not -path "./.git" -and -not -path "./" -print -exec rm -rf "{}" \;' -- --all

This command effectively checks out each commit in turn, deletes all uninteresting files from the working directory and, if anything has changed from the last commit then it checks it in (rewriting the history as it goes). You would need to tweak that command to delete all files except those in, say, /moduleA, /megaProject/moduleA and the specific files you want to keep from /megaProject.

1 Comment

It didn't work for me, for some reason it deletes .git/refs/heads, destroying my repo. Interestingly enough not all files inside .git are deleted. Do you know why this may be happening? Also, I fail to see how this solution would preserve moves/renames.
2

I'm aware of no simple way to do this, but it can be done.

The problem with filter-branch is that it works by

applying custom filters on each revision

If you can create a filter which won't delete your files they will be tracked between directories. Of course this is likely to be non-trivial for any repository which isn't trivial.

To start: Let's assume it is a trivial repository. You have never renamed a file, and you have never had files in two modules with the same name. All you need to do is get a list of the files in your module find megaProject/moduleA -type f -printf "%f\n" > preserve and then run your filter using those filenames, and your directory:

preserve.sh

cmd="find . -type f ! -name d1"
while read f; do
  cmd="$cmd ! -name $f"
done < /path/to/myCode/preserve
for i in $($cmd)
do
  rm $i
done

git filter-branch --prune-empty --tree-filter '/path/to/myCode/preserve.sh' HEAD

Of course it's renames that make this difficult. One of the nice things that git filter-branch does is gives you the $GIT_COMMIT environment variable. You can then get fancy and use things like:

for f in megaProject/moduleA
do
 git log --pretty=format:'%H' --name-only --follow -- $f |  awk '{ if($0 != ""){ printf $0 ":"; next; } print; }'
done > preserve

to build a filename history, with commits, that could be used in place of the simple preserve file in the trivial example, but the onus is going to be on you to keep track of what files should be present at each commit. This actually shouldn't be too hard to code out, but I haven't seen anybody who's done it yet.

1 Comment

That looks cool if polished, but doesn't work if applied asis
1

Following on to the answer above. First iterate through all of the files in the directory that is being kept using git log --follow to git the old paths/names from prior moves/renames. Then use filter-branch to iterate through every revision removing any files that were not on the list created in step 1.

#!/bin/bash
DIRNAME=dirD

# Catch all files including hidden files
shopt -s dotglob
for f in $DIRNAME/*
do
# print every file and follow to get any previous renames
# Then remove blank lines.  Then remove every other line to end up with the list of filenames
 git log --pretty=format:'%H' --name-only --follow -- $f | awk 'NF > 0' | awk 'NR%2==0'
done > /tmp/PRESERVE

sort -o /tmp/PRESERVE /tmp/PRESERVE
cat /tmp/PRESERVE

Then create a script (preserve.sh) that filter-branch will call for each revision.

#!/bin/bash
DIRNAME=dirD

# Delete everything that's not in the PRESERVE list
echo 'delete this files:'
cmd=`find . -type f -not -path './.git/*' -not -path './$DIRNAME/*'`
echo $cmd > /tmp/ALL


# Convert to one filename per line and remove the lead ./
cat /tmp/ALL | awk '{NF++;while(NF-->1)print $NF}' | cut -c3- > /tmp/ALL2
sort -o /tmp/ALL2 /tmp/ALL2

#echo 'before:'
#cat /tmp/ALL2

comm -23 /tmp/ALL2 /tmp/PRESERVE > /tmp/DELETE_THESE
echo 'delete these:'
cat /tmp/DELETE_THESE
#exit 0

while read f; do
  rm $f
done < /tmp/DELETE_THESE

Now use filter-branch, if all files are removed in the revision, then prune that commit and it's message.

 git filter-branch --prune-empty --tree-filter '/FULL_PATH/preserve.sh' master

3 Comments

This works well! I had only to change a few things to make it work with paths that contain spaces.
@Roberto Hi, by any chance, do you still have the version that fixes the spaces?
@Stals Hi. You have to add quotes when using the variables, like "$DIRNAME". I posted mine as a new answer.
1

Here's my version of the script @Roberto posted, written for linux/wsl. If you don't specify a "myrepo_KEEP.txt" it will create one based on the current file structure. Pass in the repo to work on:

prune.sh MyRepo

# This script should run one level up from the git repo folder (i.e. the  containing folder)
# This script uses git-filter-repo (github.com/newren/git-filter-repo).
# The result will be the folder called <your_repo_folder_name>_REWRITE_CLONE. Your original repo won't be changed.
# Tags are not preserved, see line below to preserve tags.
# Running subsequent times will backup the last run in <your_repo_folder_name>_REWRITE_CLONE_BKP.
# Optionally, list the files and folders that you want to keep the KEEP_FILE (<your_repo_folder_name>_KEEP.txt) 
## It should contain a line end in the last line, otherwise the last file/folder will be skipped.
## If this file is missing it will be created by this script with all current folders listed. 

echo "Prune git repo"

# User needs to pass in the repo name
GIT_REPO=$1

if [ -z $GIT_REPO ]; then
    echo "Pass in the directory to prune"
else
    KEEP_FILE="${GIT_REPO}"_KEEP.txt

    # Build up a list of current directories in the repo, if one hasn't been supplied
    if [ ! -f "${KEEP_FILE}" ]; then
        echo "Keeping all current files in repo (generating keep file)"
        cd $GIT_REPO
        find . -type d -not -path '*/\.*' > "../${KEEP_FILE}"
        cd ..
    fi

    echo "Pruning $GIT_REPO"

    clone="${GIT_REPO}_REWRITE_CLONE"
    
    # Shift backup
    bkp="${clone}_BKP"
    temp=/tmp/git_rewrite_temp
    echo $clone
    rm -Rf "$bkp"
    mv "$clone" "$bkp"
    
    # Setup temp
    rm -Rf "$temp"
    mkdir "$temp"   
    
    # Clone
    echo "Cloning repo...from $GIT_REPO to $clone"
    if git clone "$GIT_REPO" "$clone"; then
        cd "$clone"
        git remote remove origin

        # Comment line below to preserve tags
        git tag | xargs git tag -d

        echo 'Start logging file history...'
        echo "# git log results:\n" > "$temp"/log.txt

        # Follow the renames
        while read p
        do
            shopt -s dotglob
            find "$p" -type f > "$temp"/temp
            while read f
            do
                echo "## " "$f" >> "$temp"/log.txt
                # print every file and follow to get any previous renames
                # Then remove blank lines.  Then remove every other line to end up with the list of filenames       
                git log --pretty=format:'%H' --name-only --follow -- "$f" | awk 'NF > 0' | awk 'NR%2==0' | tee -a "$temp"/log.txt

                echo "\n\n" >> "$temp"/log.txt
            done < "$temp"/temp
        done < ../"${KEEP_FILE}" > "$temp"/PRESERVE

        mv "$temp"/PRESERVE "$temp"/PRESERVE_full
        awk '!a[$0]++' "$temp"/PRESERVE_full > "$temp"/PRESERVE

        sort -o "$temp"/PRESERVE "$temp"/PRESERVE

        echo 'Starting filter-branch --------------------------'
        git filter-repo --paths-from-file "$temp"/PRESERVE --force --replace-refs delete-no-add
        echo 'Finished filter-branch --------------------------'
        cd ..
    fi
fi

Credit to @rksawyer & @Roberto.

1 Comment

Few enhancements: 1) For generation the KEEP file I would use this: find . -maxdepth 1 -type d -not -path '/\.' -not -path '.' > "../${KEEP_FILE}" 2) instead of: done < ../"${KEEP_FILE}" > "$temp"/PRESERVE mv "$temp"/PRESERVE "$temp"/PRESERVE_full awk '!a[$0]++' "$temp"/PRESERVE_full > "$temp"/PRESERVE sort -o "$temp"/PRESERVE "$temp"/PRESERVE You can do simply: done < ../"${KEEP_FILE}" | sort | uniq > "$temp"/PRESERVE
-2

We painted ourselves into a much worse corner, with dozens of projects across dozens of branches, with each project dependent on 1-4 others, and 56k commits total. filter-branch was taking up to 24 hours just to split a single directory off.

I ended up writing a tool in .NET using libgit2sharp and raw file system access to split an arbitrary number of directories per project, and only preserve relevant commits/branches/tags for each project in the new repos. Instead of modifying the source repo, it writes out N other repos with only the configured paths/refs.

You're welcome to see if this suits your needs, modify it, etc. https://github.com/CurseStaff/GitSplit

2 Comments

The linked repo doesn't exist or isn't public.
Sounds great, would be nice to be able to see it? If you want this answer to be upvoted you'll want to post some useful details not just posting a hyperlink too, btw.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.