[Copy] Support cross-repo file copies#4203
Conversation
Add `src_repo_id` and `src_repo_type` to `CommitOperationCopy` to enable copying LFS files across repositories. For cross-repo LFS copies, the client fetches the xet hash from the source repo via `get_paths_info` and passes it to the server which handles the xet hash duplication. Regular (non-LFS) cross-repo files are downloaded and re-uploaded as before. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Basic example works on local setup with https://github.com/huggingface-internal/moon-landing/pull/18121 🎉 (upload txt + Xet files to a model, copy them to a dataset): Test script"""Test script for PR #4203: cross-repo LFS file copies.
Creates a model repo (with a txt file + LFS binary) and a dataset repo.
Copies files into the dataset repo, then verifies listing and content integrity.
"""
import os
import tempfile
from huggingface_hub import HfApi
api = HfApi(
# endpoint="https://feat-cross-repo-lfs-78.us.dev.moon.huggingface.tech",
# token="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", # tmp token from GH ephemeral
endpoint="http://localhost:5564",
token="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", # tmp local token
)
USERNAME = api.whoami()["name"]
MODEL_REPO = f"{USERNAME}/test-cross-repo-lfs-copy-model"
DATASET_REPO = f"{USERNAME}/test-cross-repo-lfs-copy-dataset"
# --- Generate random test data ---
txt_content = b"Hello, this is a plain text file for cross-repo copy testing!\n"
lfs_content = os.urandom(1024 * 1024) # 1 MB random binary -> stored as LFS
# --- Cleanup: delete then recreate everything ---
print("=== Cleanup ===")
for repo_id, repo_type in [(MODEL_REPO, "model"), (DATASET_REPO, "dataset")]:
try:
api.delete_repo(repo_id, repo_type=repo_type)
print(f" Deleted {repo_type} repo '{repo_id}'")
except Exception:
pass
print("\n=== Create repos ===")
api.create_repo(MODEL_REPO, repo_type="model", exist_ok=True)
print(f" Created model repo '{MODEL_REPO}'")
api.create_repo(DATASET_REPO, repo_type="dataset", exist_ok=True)
print(f" Created dataset repo '{DATASET_REPO}'")
# --- Upload files to model repo ---
print("\n=== Upload to model repo ===")
api.upload_file(path_or_fileobj=txt_content, path_in_repo="readme.txt", repo_id=MODEL_REPO)
print(" Uploaded readme.txt (regular file)")
api.upload_file(path_or_fileobj=lfs_content, path_in_repo="weights.bin", repo_id=MODEL_REPO)
print(" Uploaded weights.bin (LFS file)")
# --- Copy model repo files to dataset (cross-repo copy!) ---
print("\n=== Copy model -> dataset (cross-repo) ===")
api.copy_files(
f"hf://{MODEL_REPO}/",
f"hf://datasets/{DATASET_REPO}/from_model/",
)
print(" Copied model repo files to dataset repo under 'from_model/'")
# --- Verify: list all files in dataset repo ---
print("\n=== Verify: list files in dataset repo ===")
files = list(api.list_repo_tree(DATASET_REPO, repo_type="dataset", recursive=True))
file_paths = sorted(f.path for f in files if hasattr(f, "size") and not f.path.endswith(".gitattributes"))
print(f" Files found: {file_paths}")
expected = sorted(["from_model/readme.txt", "from_model/weights.bin"])
assert file_paths == expected, f"Expected {expected}, got {file_paths}"
print(" OK - file listing matches!")
# --- Verify: download and compare content ---
print("\n=== Verify: download and compare content ===")
with tempfile.TemporaryDirectory() as tmpdir:
# 1. readme.txt (regular file from model)
path = api.hf_hub_download(DATASET_REPO, "from_model/readme.txt", repo_type="dataset", local_dir=tmpdir)
downloaded = open(path, "rb").read()
assert downloaded == txt_content, f"readme.txt mismatch: {len(downloaded)} vs {len(txt_content)} bytes"
print(" OK - from_model/readme.txt content matches")
# 2. weights.bin (LFS file from model - the main cross-repo LFS copy test!)
path = api.hf_hub_download(DATASET_REPO, "from_model/weights.bin", repo_type="dataset", local_dir=tmpdir)
downloaded = open(path, "rb").read()
assert downloaded == lfs_content, f"weights.bin mismatch: {len(downloaded)} vs {len(lfs_content)} bytes"
print(" OK - from_model/weights.bin content matches (cross-repo LFS copy works!)")
print("\n=== ALL CHECKS PASSED ===")Script output |
- Update CommitOperationCopy docstring: document cross-repo copy, add examples - Add unit tests for CommitOperationCopy validation (src_repo_id/src_repo_type) - Add parametrized unit tests for _resolve_copy_target_path - Add integration test for cross-repo copy via create_commit (test_hf_api.py) - Add repo-to-repo copy_files integration tests (test_buckets.py) - Update upload guide: document new src_repo_id/src_repo_type args - Update buckets guide: rename section, add repo-to-repo examples Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…to_copy Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
||
| commit_payload = _prepare_commit_payload( | ||
| operations=operations, | ||
| operations=operations_without_no_op, |
There was a problem hiding this comment.
this is a fix unrelated to this PR
| repo_type=destination.type, | ||
| revision=destination.revision, | ||
| operations=commit_ops, | ||
| commit_message=f"Copy files from {source.type}s/{source.id}", |
There was a problem hiding this comment.
Same-repo copy commit message references own repo
Low Severity
The commit message in _copy_to_repo always uses f"Copy files from {source.type}s/{source.id}", even for same-repo copies (where is_same_repo is True). This produces misleading commit messages like "Copy files from models/user/my-model" when copying within the same repo, which is confusing when viewing commit history.
Reviewed by Cursor Bugbot for commit a6e7df0. Configure here.
There was a problem hiding this comment.
Thank you! my main comment is #4203 (comment). my other comments are mostly nits
| seen_oids: set[str] = set() | ||
|
|
||
| for paths_batch in chunk_iterable(src_paths, 500): | ||
| src_repo_files = self.get_paths_info( |
There was a problem hiding this comment.
_fetch_files_to_copy already called get_paths_info, maybe we can pass the output of _fetch_files_to_copy into _duplicate_lfs_files and iterate the ifs entries from there instead of refetching?
basically in
huggingface_hub/src/huggingface_hub/hf_api.py
Lines 4949 to 4950 in 5828612
we can do:
files_to_copy = _fetch_files_to_copy(...)
self._duplicate_lfs_files(
repo_id=repo_id,
copies=copies,
files_to_copy=files_to_copy,
token=token,
repo_type=repo_type,
)
to avoid the HTTP POST calls twice| class TestCommitOperationCopy(unittest.TestCase): | ||
| def test_cross_repo_copy_missing_repo_id_or_type(self): | ||
| with pytest.raises(ValueError, match="`src_repo_type` is required when `src_repo_id` is set"): | ||
| CommitOperationCopy(src_path_in_repo="src.bin", path_in_repo="dst.bin", src_repo_id="user/source") | ||
|
|
||
| with pytest.raises(ValueError, match="`src_repo_id` is required when `src_repo_type` is set"): | ||
| CommitOperationCopy(src_path_in_repo="src.bin", path_in_repo="dst.bin", src_repo_type="model") | ||
|
|
||
| def test_path_normalization(self): | ||
| op = CommitOperationCopy(src_path_in_repo="./src.bin", path_in_repo="/dst.bin") | ||
| assert op.src_path_in_repo == "src.bin" | ||
| assert op.path_in_repo == "dst.bin" |
There was a problem hiding this comment.
I'd rather have plain pytest here
There was a problem hiding this comment.
I've removed the unittest.TestCase inheritance (it was not used indeed) but kept the class for namespace purposes (I usually prefer not to add the class namespace but here it's consistent with the module). The tests themselves were already pure-pytest
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 6ba9530. Configure here.
Co-authored-by: célina <hanouticelina@gmail.com>
|
Thanks for the review! I addressed all the comments :) |
hanouticelina
left a comment
There was a problem hiding this comment.
this is great, thank you!
|
This PR has been shipped as part of the v1.17.0 release. |


Summary
Closes #3874 (partially — bucket-to-repo is not in scope).
Adds support for copying files between repositories (model→model, model→dataset, etc.) via
CommitOperationCopyandcopy_files. This builds on the server-side/lfs-files/duplicateendpoint (https://github.com/huggingface-internal/moon-landing/pull/18121).What changed
CommitOperationCopynow acceptssrc_repo_idandsrc_repo_typefor cross-repo copies. Both LFS and regular files are supported — LFS files are duplicated server-side, regular files are downloaded and re-uploaded as part of the commit.copy_files(hf buckets cp) now supports repo-to-repo in addition to bucket destinations. Internally refactored into_copy_to_bucketand_copy_to_repopaths._duplicate_lfs_filesonHfApihandles LFS object duplication before commit (called automatically bycreate_commit). RaisesFileDuplicationErroron failure._resolve_copy_target_pathextracted as a standalone function for path resolution (shared by bucket and repo copy paths)._fetch_files_to_copyunified to handle both intra-repo and cross-repo copies — resolves file metadata from the source repo whensrc_repo_idis set.Copy matrix
🤖 Generated with Claude Code
Note
Medium Risk
Changes commit and LFS duplication flows for cross-repo copies; impact is mitigated by tests and by still blocking bucket-to-repo copies.
Overview
This PR adds repo-to-repo copying via [
copy_files] and [CommitOperationCopy], alongside existing bucket copy paths.[
CommitOperationCopy] now accepts optionalsrc_repo_idandsrc_repo_typefor cross-repo sources. LFS blobs are duplicated on the Hub with a new [_duplicate_lfs_files] step (batched/lfs-files/duplicate) before [create_commit]; failures raiseFileDuplicationError. Non-LFS files are still fetched from the source repo and committed as regular file payloads. [_fetch_files_to_copy] is keyed by a_CopySourcetuple so metadata and downloads resolve against the correct source repo.[
copy_files] is split into_copy_to_bucketand_copy_to_repo; shared destination path rules live in_resolve_copy_target_path(single file, folder nesting, trailing-/rsync semantics). Bucket → repo remains explicitly unsupported.Docs cover repo-to-repo usage in the upload and buckets guides, and [
create_commit] documents the new copy fields.Reviewed by Cursor Bugbot for commit a59e63a. Bugbot is set up for automated code reviews on this repo. Configure here.