Skip to content

[Fix] Make concurrent downloads safe even when file locking is broken#4306

Merged
Wauplin merged 2 commits into
mainfrom
robust-concurrent-downloads
Jun 5, 2026
Merged

[Fix] Make concurrent downloads safe even when file locking is broken#4306
Wauplin merged 2 commits into
mainfrom
robust-concurrent-downloads

Conversation

@Wauplin

@Wauplin Wauplin commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

TL;DR:

  • filelock is not bullet-proof on some filesystems (see issue)
  • let's continue to use it as "best-effort" to avoid duplicate downloads in parallel (as before)
  • BUT let's not make the file consistency depends on filelock i.e. we always download to a new tmp file instead of resuming from incomplete file

⚠️ Breaking change: resuming a failed download is no longer possible. However file resumability was already a niche use case only applicable when hf_xet is disabled.


Alternative to #4228 (thanks @vadiklyutiy for the report and investigation!) addressing the cache corruption reported there, also related to tox-dev/filelock#389 and huggingface/transformers#30859.

On Lustre/GPFS and some NFS mounts, flock(2) silently succeeds for every caller, so WeakFileLock is a no-op and concurrent hf_hub_download calls end up appending to the same shared <etag>.incomplete file. Worst case is silent cache poisoning: the first process to finish passes its consistency check and renames the file into the blob store, while a slower process still holds an open fd on the renamed inode and keeps appending to the now-live blob.

The bug can be reproduced on any machine (no Lustre needed) with the script below: it patches fcntl.flock to a silent no-op — exactly what those filesystems do — and races two hf_hub_download processes with worst-case timing. On main, it reliably poisons the cache: a corrupted blob (134,981,632 bytes instead of 125,162,496, wrong sha256) behind a valid pointer, which is the same symptom as the vllm serve crash described in #4228. With this PR, the same script ends with both processes succeeding and a bit-identical blob.

repro_4228.py
"""Repro for PR #4228: silent cache corruption when flock() is a no-op (Lustre/GPFS/NFS).

Simulates the broken-filesystem behavior locally by patching `fcntl.flock` to a silent
no-op (exactly what Lustre/GPFS do: return success for every caller without blocking).

Scenario reproduced (the "silent poisoning" variant seen with multi-process vllm serve):
  - Process A starts downloading a 125MB file to `<etag>.incomplete` (opened in append mode).
  - Mid-download, process B enters the same code path. The no-op lock lets it in. B opens the
    same `.incomplete` file in append mode, sees resume_size=current size, and requests the
    missing tail via a Range request. B is a bit slower than A (here: simulated with a delay
    before its first write — in real life: network jitter, slower node, GIL pause, ...).
  - A finishes first. Its consistency check passes (B hasn't written yet), so A renames
    `.incomplete` -> blob and creates the pointer symlink. The cache looks healthy.
  - B still holds an open fd on the *renamed* inode (rename doesn't invalidate open fds) and
    appends its tail bytes to the now-live blob. The blob silently becomes larger than
    expected with duplicated bytes, while a valid pointer exists -> every subsequent reader
    gets a corrupted file. B itself fails its consistency check, but the damage is done.

Run: HF_HUB_DISABLE_XET=1 .venv/bin/python repro_4228.py
"""

import hashlib
import os
import subprocess
import sys
import tempfile
import time
from pathlib import Path

REPO_ID = "openai-community/gpt2"
FILENAME = "64-8bits.tflite"
EXPECTED_SIZE = 125162496
EXPECTED_SHA256 = "c966da3b74697803352ca7c6f2f220e7090a557b619de9da0c6b34d89f7825c1"

WORKER_A = """
import sys
import fcntl
fcntl.flock = lambda *a, **k: None  # Lustre/GPFS behavior: silent success for every caller
from huggingface_hub import hf_hub_download
path = hf_hub_download(repo_id=sys.argv[1], filename=sys.argv[2])
print("worker A done:", path)
"""

# Worker B: same no-op flock, but (1) pre-warms imports + TLS connection, then waits for a
# signal file so it can enter the download loop while A is mid-flight, and (2) delays its
# first chunk write by a few seconds to emulate a slower sibling process.
WORKER_B = """
import os
import sys
import time
import fcntl
fcntl.flock = lambda *a, **k: None  # Lustre/GPFS behavior: silent success for every caller

import huggingface_hub.file_download as fd
from huggingface_hub import hf_hub_download, get_hf_file_metadata, hf_hub_url

repo_id, filename, signal_file = sys.argv[1], sys.argv[2], sys.argv[3]

# Pre-warm DNS/TLS + imports so the post-signal latency is minimal.
get_hf_file_metadata(hf_hub_url(repo_id=repo_id, filename=filename))

_orig_http_get = fd.http_get

class SlowFirstWrite:
    def __init__(self, f):
        self._f = f
        self._first = True
    def write(self, data):
        if self._first:
            time.sleep(6)  # let A finish, pass its consistency check, and rename the blob
            self._first = False
        return self._f.write(data)
    def __getattr__(self, name):
        return getattr(self._f, name)

def slow_http_get(url, temp_file, **kwargs):
    return _orig_http_get(url, SlowFirstWrite(temp_file), **kwargs)

fd.http_get = slow_http_get

while not os.path.exists(signal_file):
    time.sleep(0.01)

path = hf_hub_download(repo_id=repo_id, filename=filename)
print("worker B done:", path)
"""


def main() -> None:
    tmp = tempfile.mkdtemp(prefix="hf_repro_4228_")
    env = {**os.environ, "HF_HOME": tmp, "HF_HUB_DISABLE_XET": "1"}
    signal_file = os.path.join(tmp, "go_b")
    print(f"HF_HOME={tmp}")

    repo_dir = Path(tmp) / "hub" / "models--openai-community--gpt2"

    b = subprocess.Popen(
        [sys.executable, "-c", WORKER_B, REPO_ID, FILENAME, signal_file],
        env=env, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True,
    )
    time.sleep(3)  # let B pre-warm
    a = subprocess.Popen(
        [sys.executable, "-c", WORKER_A, REPO_ID, FILENAME],
        env=env, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True,
    )

    # Signal B once A's download is ~half way.
    deadline = time.time() + 300
    while time.time() < deadline:
        candidates = list(repo_dir.glob("blobs/*.incomplete")) if repo_dir.exists() else []
        if candidates and candidates[0].stat().st_size > 0.85 * EXPECTED_SIZE:
            print(f"incomplete at {candidates[0].stat().st_size / EXPECTED_SIZE:.0%} -> signaling worker B")
            Path(signal_file).touch()
            break
        if a.poll() is not None:
            print("worker A finished too early; rerun")
            print(a.stdout.read())
            b.kill()
            return
        time.sleep(0.02)

    out_a, _ = a.communicate()
    print(f"--- worker A (exit {a.returncode}) ---\n{out_a}")

    blob = repo_dir / "blobs" / EXPECTED_SHA256
    if blob.exists():
        print(f"blob right after A finished: size={blob.stat().st_size} (expected {EXPECTED_SIZE}) -> looks healthy")

    out_b, _ = b.communicate()
    print(f"--- worker B (exit {b.returncode}) ---\n{out_b}")

    pointer = next((repo_dir / "snapshots").rglob(FILENAME), None) if (repo_dir / "snapshots").exists() else None
    print(f"pointer: {pointer}")
    if blob.exists():
        actual_size = blob.stat().st_size
        actual_sha = hashlib.sha256(blob.read_bytes()).hexdigest()
        print(f"final blob size: {actual_size} (expected {EXPECTED_SIZE})")
        print(f"final blob sha256: {actual_sha}\n     expected sha256: {EXPECTED_SHA256}")
        if pointer is not None and (actual_size != EXPECTED_SIZE or actual_sha != EXPECTED_SHA256):
            print("\n*** SILENT CACHE POISONING: corrupted blob behind a valid pointer ***")
        elif actual_size == EXPECTED_SIZE and actual_sha == EXPECTED_SHA256:
            print("\nblob is intact (race not triggered this run, try again)")
    else:
        print(f"blob missing: {blob}")


if __name__ == "__main__":
    main()

Instead of detecting broken filesystems, this PR removes the lock from the correctness equation: each download now writes to a process-unique temporary file (<etag>.<random>.incomplete) and atomically renames it into place, and refs/<revision> is written atomically (tmp file + rename) so readers can never observe a truncated ref — that ref write happens outside any lock today and was racy even on local filesystems. With this change a broken lock costs only duplicated bandwidth, on any filesystem, including configurations that detection can't catch (FUSE wrappers, Lustre localflock, NFS local_lock=flock). The lock is kept as a best-effort way to avoid downloading the same file twice.

Warning

Breaking change: interrupted downloads are no longer resumed by a later call — each download writes to its own unique temporary file (in-process retry/resume in http_get is unaffected). Sharing a partial file across processes is exactly what made the corruption possible. Xet-backed downloads (the default for new repos) recover most of the resume value through chunk dedup.

🤖 Generated with Claude Code


Note

Medium Risk
Changes core download/cache write paths and drops cross-process resume from shared incomplete files, which can increase bandwidth on failure but fixes silent blob corruption on shared/cluster filesystems.

Overview
Makes Hub cache downloads safe when flock-based locks are ineffective (e.g. Lustre/GPFS/some NFS), where concurrent processes could corrupt a shared <etag>.incomplete blob.

_download_to_tmp_and_move no longer appends to or resumes a shared incomplete file. Each download writes to a process-unique *.{uuid}.incomplete path, then _chmod_and_move into the final blob; a finally block removes any leftover temp file. Cross-process resume from .incomplete is removed (in-process http_get retries are unchanged). _cache_commit_hash_for_specific_revision now writes refs via tmp + os.replace so readers never see a half-written ref.

WeakFileLock stays as best-effort deduplication only; comments document that correctness does not depend on it. Tests drop resume/force_download incomplete tests, add cleanup of *.incomplete, and adjust the legacy-complete-incomplete regression scenario.

Reviewed by Cursor Bugbot for commit 75cbbca. Bugbot is set up for automated code reviews on this repo. Configure here.

…ption

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bot-ci-comment

bot-ci-comment Bot commented Jun 3, 2026

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Wauplin Wauplin changed the title [Download] Make concurrent downloads safe even when file locking is broken [Fix] Make concurrent downloads safe even when file locking is broken Jun 3, 2026
@Wauplin Wauplin marked this pull request as ready for review June 3, 2026 15:31
@Wauplin Wauplin requested a review from hanouticelina June 3, 2026 15:31

@hanouticelina hanouticelina left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the solution looks good to me 👍

@Wauplin Wauplin merged commit c505f77 into main Jun 5, 2026
23 of 26 checks passed
@Wauplin Wauplin deleted the robust-concurrent-downloads branch June 5, 2026 08:58
@huggingface-hub-bot

Copy link
Copy Markdown
Contributor

This PR has been shipped as part of the v1.18.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants