Skip to content

[Copy] Support cross-repo file copies#4203

Merged
Wauplin merged 37 commits into
mainfrom
feat/cross-repo-lfs-copy
May 28, 2026
Merged

[Copy] Support cross-repo file copies#4203
Wauplin merged 37 commits into
mainfrom
feat/cross-repo-lfs-copy

Conversation

@Wauplin

@Wauplin Wauplin commented May 7, 2026

Copy link
Copy Markdown
Collaborator

Summary

Closes #3874 (partially — bucket-to-repo is not in scope).

Adds support for copying files between repositories (model→model, model→dataset, etc.) via CommitOperationCopy and copy_files. This builds on the server-side /lfs-files/duplicate endpoint (https://github.com/huggingface-internal/moon-landing/pull/18121).

What changed

CommitOperationCopy now accepts src_repo_id and src_repo_type for cross-repo copies. Both LFS and regular files are supported — LFS files are duplicated server-side, regular files are downloaded and re-uploaded as part of the commit.

copy_files (hf buckets cp) now supports repo-to-repo in addition to bucket destinations. Internally refactored into _copy_to_bucket and _copy_to_repo paths.

_duplicate_lfs_files on HfApi handles LFS object duplication before commit (called automatically by create_commit). Raises FileDuplicationError on failure.

_resolve_copy_target_path extracted as a standalone function for path resolution (shared by bucket and repo copy paths).

_fetch_files_to_copy unified to handle both intra-repo and cross-repo copies — resolves file metadata from the source repo when src_repo_id is set.

Copy matrix

Source Destination Status
Bucket Bucket ✅ (existing)
Repo Bucket ✅ (existing)
Repo Repo new
Bucket Repo ❌ not supported

🤖 Generated with Claude Code


Note

Medium Risk
Changes commit and LFS duplication flows for cross-repo copies; impact is mitigated by tests and by still blocking bucket-to-repo copies.

Overview
This PR adds repo-to-repo copying via [copy_files] and [CommitOperationCopy], alongside existing bucket copy paths.

[CommitOperationCopy] now accepts optional src_repo_id and src_repo_type for cross-repo sources. LFS blobs are duplicated on the Hub with a new [_duplicate_lfs_files] step (batched /lfs-files/duplicate) before [create_commit]; failures raise FileDuplicationError. Non-LFS files are still fetched from the source repo and committed as regular file payloads. [_fetch_files_to_copy] is keyed by a _CopySource tuple so metadata and downloads resolve against the correct source repo.

[copy_files] is split into _copy_to_bucket and _copy_to_repo; shared destination path rules live in _resolve_copy_target_path (single file, folder nesting, trailing-/ rsync semantics). Bucket → repo remains explicitly unsupported.

Docs cover repo-to-repo usage in the upload and buckets guides, and [create_commit] documents the new copy fields.

Reviewed by Cursor Bugbot for commit a59e63a. Bugbot is set up for automated code reviews on this repo. Configure here.

Add `src_repo_id` and `src_repo_type` to `CommitOperationCopy` to enable copying
LFS files across repositories. For cross-repo LFS copies, the client fetches the
xet hash from the source repo via `get_paths_info` and passes it to the server which
handles the xet hash duplication. Regular (non-LFS) cross-repo files are downloaded
and re-uploaded as before.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Wauplin Wauplin changed the title feat: cross-repo LFS file copies in CommitOperationCopy [Draft] Cross-repo Xet file copies in CommitOperationCopy May 7, 2026
@Wauplin Wauplin changed the title [Draft] Cross-repo Xet file copies in CommitOperationCopy [draft] Cross-repo Xet file copies in CommitOperationCopy May 7, 2026
@bot-ci-comment

bot-ci-comment Bot commented May 7, 2026

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread src/huggingface_hub/hf_api.py Outdated
@Wauplin

Wauplin commented May 7, 2026

Copy link
Copy Markdown
Collaborator Author

Basic example works on local setup with https://github.com/huggingface-internal/moon-landing/pull/18121 🎉 (upload txt + Xet files to a model, copy them to a dataset):

=== Create repos ===
=== Upload to model repo ===
=== Copy model -> dataset (cross-repo) ===
  Copied model repo files to dataset repo under 'from_model/'
=== Verify: list files in dataset repo ===
=== Verify: download and compare content ===
=== ALL CHECKS PASSED ===
Test script
"""Test script for PR #4203: cross-repo LFS file copies.

Creates a model repo (with a txt file + LFS binary) and a dataset repo.
Copies files into the dataset repo, then verifies listing and content integrity.
"""

import os
import tempfile

from huggingface_hub import HfApi


api = HfApi(
    # endpoint="https://feat-cross-repo-lfs-78.us.dev.moon.huggingface.tech",
    # token="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",  # tmp token from GH ephemeral
    endpoint="http://localhost:5564",
    token="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",  # tmp local token
)
USERNAME = api.whoami()["name"]

MODEL_REPO = f"{USERNAME}/test-cross-repo-lfs-copy-model"
DATASET_REPO = f"{USERNAME}/test-cross-repo-lfs-copy-dataset"

# --- Generate random test data ---
txt_content = b"Hello, this is a plain text file for cross-repo copy testing!\n"
lfs_content = os.urandom(1024 * 1024)  # 1 MB random binary -> stored as LFS

# --- Cleanup: delete then recreate everything ---
print("=== Cleanup ===")
for repo_id, repo_type in [(MODEL_REPO, "model"), (DATASET_REPO, "dataset")]:
    try:
        api.delete_repo(repo_id, repo_type=repo_type)
        print(f"  Deleted {repo_type} repo '{repo_id}'")
    except Exception:
        pass

print("\n=== Create repos ===")
api.create_repo(MODEL_REPO, repo_type="model", exist_ok=True)
print(f"  Created model repo '{MODEL_REPO}'")
api.create_repo(DATASET_REPO, repo_type="dataset", exist_ok=True)
print(f"  Created dataset repo '{DATASET_REPO}'")

# --- Upload files to model repo ---
print("\n=== Upload to model repo ===")
api.upload_file(path_or_fileobj=txt_content, path_in_repo="readme.txt", repo_id=MODEL_REPO)
print("  Uploaded readme.txt (regular file)")
api.upload_file(path_or_fileobj=lfs_content, path_in_repo="weights.bin", repo_id=MODEL_REPO)
print("  Uploaded weights.bin (LFS file)")

# --- Copy model repo files to dataset (cross-repo copy!) ---
print("\n=== Copy model -> dataset (cross-repo) ===")
api.copy_files(
    f"hf://{MODEL_REPO}/",
    f"hf://datasets/{DATASET_REPO}/from_model/",
)
print("  Copied model repo files to dataset repo under 'from_model/'")

# --- Verify: list all files in dataset repo ---
print("\n=== Verify: list files in dataset repo ===")
files = list(api.list_repo_tree(DATASET_REPO, repo_type="dataset", recursive=True))
file_paths = sorted(f.path for f in files if hasattr(f, "size") and not f.path.endswith(".gitattributes"))
print(f"  Files found: {file_paths}")
expected = sorted(["from_model/readme.txt", "from_model/weights.bin"])
assert file_paths == expected, f"Expected {expected}, got {file_paths}"
print("  OK - file listing matches!")

# --- Verify: download and compare content ---
print("\n=== Verify: download and compare content ===")
with tempfile.TemporaryDirectory() as tmpdir:
    # 1. readme.txt (regular file from model)
    path = api.hf_hub_download(DATASET_REPO, "from_model/readme.txt", repo_type="dataset", local_dir=tmpdir)
    downloaded = open(path, "rb").read()
    assert downloaded == txt_content, f"readme.txt mismatch: {len(downloaded)} vs {len(txt_content)} bytes"
    print("  OK - from_model/readme.txt content matches")

    # 2. weights.bin (LFS file from model - the main cross-repo LFS copy test!)
    path = api.hf_hub_download(DATASET_REPO, "from_model/weights.bin", repo_type="dataset", local_dir=tmpdir)
    downloaded = open(path, "rb").read()
    assert downloaded == lfs_content, f"weights.bin mismatch: {len(downloaded)} vs {len(lfs_content)} bytes"
    print("  OK - from_model/weights.bin content matches (cross-repo LFS copy works!)")

print("\n=== ALL CHECKS PASSED ===")
Script output
=== Create repos ===
  Created model repo 'julien-c/test-cross-repo-lfs-copy-model'
  Created dataset repo 'julien-c/test-cross-repo-lfs-copy-dataset'

=== Upload to model repo ===
  Uploaded readme.txt (regular file)
Processing Files (1 / 1)      : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.05MB / 1.05MB,   680B/s  
New Data Upload               : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.05MB / 1.05MB,   680B/s  
                              : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.05MB / 1.05MB            
  Uploaded weights.bin (LFS file)

=== Copy model -> dataset (cross-repo) ===
  Copied model repo files to dataset repo under 'from_model/'

=== Verify: list files in dataset repo ===
  Files found: ['from_model/readme.txt', 'from_model/weights.bin']
  OK - file listing matches!

=== Verify: download and compare content ===
readme.txt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 62.0/62.0 [00:00<00:00, 284kB/s]
  OK - from_model/readme.txt content matches
from_model/weights.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.05M/1.05M [00:00<00:00, 2.56MB/s]
  OK - from_model/weights.bin content matches (cross-repo LFS copy works!)

=== ALL CHECKS PASSED ===

Comment thread src/huggingface_hub/hf_api.py Outdated
Comment thread src/huggingface_hub/_commit_api.py
Comment thread src/huggingface_hub/hf_api.py Outdated
Comment thread src/huggingface_hub/hf_api.py Outdated
Comment thread src/huggingface_hub/hf_api.py
Comment thread src/huggingface_hub/hf_api.py
Comment thread src/huggingface_hub/hf_api.py Outdated
@Wauplin Wauplin changed the title [draft] Cross-repo Xet file copies in CommitOperationCopy Cross-repo Xet file copies in CommitOperationCopy May 21, 2026
@Wauplin Wauplin changed the title Cross-repo Xet file copies in CommitOperationCopy Cross-repo Xet file copies May 21, 2026
@Wauplin Wauplin changed the title Cross-repo Xet file copies Cross-repo file copies (repo -> repo) May 21, 2026
- Update CommitOperationCopy docstring: document cross-repo copy, add examples
- Add unit tests for CommitOperationCopy validation (src_repo_id/src_repo_type)
- Add parametrized unit tests for _resolve_copy_target_path
- Add integration test for cross-repo copy via create_commit (test_hf_api.py)
- Add repo-to-repo copy_files integration tests (test_buckets.py)
- Update upload guide: document new src_repo_id/src_repo_type args
- Update buckets guide: rename section, add repo-to-repo examples

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Wauplin Wauplin changed the title Cross-repo file copies (repo -> repo) [Copy] Support cross-repo file copies May 21, 2026
Wauplin and others added 2 commits May 22, 2026 12:38
…to_copy

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

commit_payload = _prepare_commit_payload(
operations=operations,
operations=operations_without_no_op,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a fix unrelated to this PR

Comment thread src/huggingface_hub/hf_api.py
Comment thread src/huggingface_hub/hf_api.py
@Wauplin Wauplin marked this pull request as ready for review May 22, 2026 11:09
@Wauplin Wauplin requested a review from hanouticelina May 22, 2026 11:10
repo_type=destination.type,
revision=destination.revision,
operations=commit_ops,
commit_message=f"Copy files from {source.type}s/{source.id}",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same-repo copy commit message references own repo

Low Severity

The commit message in _copy_to_repo always uses f"Copy files from {source.type}s/{source.id}", even for same-repo copies (where is_same_repo is True). This produces misleading commit messages like "Copy files from models/user/my-model" when copying within the same repo, which is confusing when viewing commit history.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a6e7df0. Configure here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not an issue

@hanouticelina hanouticelina left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! my main comment is #4203 (comment). my other comments are mostly nits

Comment thread src/huggingface_hub/hf_api.py Outdated
seen_oids: set[str] = set()

for paths_batch in chunk_iterable(src_paths, 500):
src_repo_files = self.get_paths_info(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_fetch_files_to_copy already called get_paths_info, maybe we can pass the output of _fetch_files_to_copy into _duplicate_lfs_files and iterate the ifs entries from there instead of refetching?
basically in

files_to_copy = _fetch_files_to_copy(
copies=copies,

we can do:

files_to_copy = _fetch_files_to_copy(...)
self._duplicate_lfs_files(
    repo_id=repo_id,
    copies=copies,
    files_to_copy=files_to_copy,
    token=token,
    repo_type=repo_type,
)
to avoid the HTTP POST calls twice

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, addressed in 393fa63

Comment thread src/huggingface_hub/hf_api.py Outdated
Comment thread src/huggingface_hub/hf_api.py
Comment thread tests/test_commit_api.py Outdated
Comment on lines +151 to +162
class TestCommitOperationCopy(unittest.TestCase):
def test_cross_repo_copy_missing_repo_id_or_type(self):
with pytest.raises(ValueError, match="`src_repo_type` is required when `src_repo_id` is set"):
CommitOperationCopy(src_path_in_repo="src.bin", path_in_repo="dst.bin", src_repo_id="user/source")

with pytest.raises(ValueError, match="`src_repo_id` is required when `src_repo_type` is set"):
CommitOperationCopy(src_path_in_repo="src.bin", path_in_repo="dst.bin", src_repo_type="model")

def test_path_normalization(self):
op = CommitOperationCopy(src_path_in_repo="./src.bin", path_in_repo="/dst.bin")
assert op.src_path_in_repo == "src.bin"
assert op.path_in_repo == "dst.bin"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather have plain pytest here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the unittest.TestCase inheritance (it was not used indeed) but kept the class for namespace purposes (I usually prefer not to add the class namespace but here it's consistent with the module). The tests themselves were already pure-pytest

e2f4f0a

Comment thread src/huggingface_hub/hf_api.py Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6ba9530. Configure here.

Comment thread src/huggingface_hub/hf_api.py Outdated
@Wauplin Wauplin requested a review from hanouticelina May 27, 2026 14:47
@Wauplin

Wauplin commented May 27, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks for the review! I addressed all the comments :)

@hanouticelina hanouticelina left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great, thank you!

@Wauplin Wauplin merged commit bb09fa4 into main May 28, 2026
35 of 41 checks passed
@Wauplin Wauplin deleted the feat/cross-repo-lfs-copy branch May 28, 2026 09:57
@huggingface-hub-bot

Copy link
Copy Markdown
Contributor

This PR has been shipped as part of the v1.17.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants