Skip to content

Call migrate() BlobFn on every blob#4671

Merged
bk2204 merged 3 commits intogit-lfs:mainfrom
ycongal-smile:4628_dup_files_v2
Oct 12, 2021
Merged

Call migrate() BlobFn on every blob#4671
bk2204 merged 3 commits intogit-lfs:mainfrom
ycongal-smile:4628_dup_files_v2

Conversation

@ycongal-smile
Copy link
Contributor

Caching is now done inside the BlobFn function instead of in the caller. To do this, the OID of the original blob is given as an argument to BlobFn.

Adapt migrate() calls to this :

  • For "migrate info", a set of "seen" OID is cached to avoid counting them twice.
  • For "migrate import", the cleaned blob is cached and reused.

Fixes #4628

Also add a test case for #4628 and adapt unittest to follow the BlobFn change

Duplicated files = same name and same content.

This reproduce issue #4628
Caching is now done inside the BlobFn function instead of in the caller.
To do this, the OID of the original blob is given as an argument to
BlobFn.

Adapt migrate() calls to this :
 * For "migrate info", a set of "seen" OID is cached to avoid counting
   them twice.
 * For "migrate import", the cleaned blob is cached and reused.

Fixes #4628
TestRewriterDoesntVisitUnchangedSubtrees:
* Changed to TestRewriterDoesVisitUnchangedSubtrees (BlobFn will now be
  called on every blobs whether they changed or not)
* Assert changed to match that.

TestRewriterIgnoresPathsThatDontMatchFilter:
* Update "seen" count: "subdir/b.txt" should now be seen twice.

TestHistoryRewriterCallbacksSubtrees:
* Update asserted call order. BlobFn("a.txt") is now called twice.
@ycongal-smile
Copy link
Contributor Author

(Fixed format errors)

Copy link
Member

@bk2204 bk2204 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey,

Thanks again for the patch. This looks good to me.

I've re-run the tests because we've had some flaky tests recently, which I've opened #4674 for, and assuming this passes, I'll go ahead and merge it, probably on Tuesday.

@ycongal-smile
Copy link
Contributor Author

Thanks!

@ycongal-smile ycongal-smile deleted the 4628_dup_files_v2 branch October 12, 2021 20:58
bk2204 added a commit to bk2204/git-lfs that referenced this pull request Oct 25, 2021
bk2204 added a commit to bk2204/git-lfs that referenced this pull request Jan 18, 2022
When we cache files, do so on the full path instead of just the
directory entry.  This means that when we have an identical file with
the same name in two different direectories, we distinguish between the
two paths and ensure both are added to .gitattributes.

This is an alternate solution to git-lfs#4671 which should perform better.  For
compmarison, with a clone of Git's main repository with the following
command, we get:

git lfs migrate import --everything --include="*.h":

* v3.0.1      (broken):  608s user,   53s system,    5:34 total
* v3.0.2      (fixed): 13435s user, 1255s system, 1:43:17 total
* this commit (fixed):   716s user,   67s system,    6:59 total

This is a much better performance characteristic for equivalent results.

Preserve the integration from the earlier attempt at fixing this plus
add an additional one.  Avoid using assert_pointer in the new test
because that helper doesn't always work correctly when there are two
files with the same file name.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

git-lfs migrate import does not add attributes for copied files (same name and content)

2 participants