Skip to content

[Bug] entire model gets deleted when a single file is corrupt #14754

@kmod

Description

@kmod

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

I started sglang on my newly downloaded model, and much to my surprise sglang deleted the entire model from disk. I believe this is an uncommon corner case from this PR: #13729 @alisonshao

I am using sglang==0.5.6.post1, transformers==5.0.0rc0, huggingface-hub==1.2.1. I first used .venv/bin/hf download QuantTrio/DeepSeek-V3.2-AWQ, then I launched sglang with

.venv/bin/python -m sglang.launch_server --model QuantTrio/DeepSeek-V3.2-AWQ --served-model-name QuantTrio/DeepSeek-V3.2-AWQ --host localhost --port 8000 --mem-fraction-static 0.95 --sleep-on-idle --tp=4 --context-length 32768 --attention-backend flashinfer --chunked-prefill-size 8192 --enable-mixed-chunk --cuda-graph-max-bs 1 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'

Which produced this log

...
[2025-12-09 14:26:48 TP0] Init torch distributed ends. mem usage=0.28 GB
[2025-12-09 14:26:48 TP2] Init torch distributed ends. mem usage=0.28 GB
[2025-12-09 14:26:48 TP1] Init torch distributed ends. mem usage=0.28 GB
[2025-12-09 14:26:48 TP3] Init torch distributed ends. mem usage=0.28 GB
[2025-12-09 14:26:49 TP1] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined
[2025-12-09 14:26:49 TP3] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined
[2025-12-09 14:26:49 TP0] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined
[2025-12-09 14:26:49 TP2] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined
[2025-12-09 14:26:49 TP1] Load weight begin. avail mem=94.10 GB
[2025-12-09 14:26:49 TP0] Load weight begin. avail mem=94.10 GB
[2025-12-09 14:26:49 TP3] Load weight begin. avail mem=94.10 GB
[2025-12-09 14:26:49 TP2] Load weight begin. avail mem=94.10 GB
[2025-12-09 14:26:50 TP0] Shared experts fusion optimization enabled.
[2025-12-09 14:26:50 TP0] Corrupted safetensors file detected: /home/kmod/.cache/huggingface/hub/models--QuantTrio--DeepSeek-V3.2-AWQ/snapshots/340023cb6036c97c5c664ac944300e9d2b1a3f2e/model-00008-of-00121.safetensors - SafetensorError: Error while deserializing header: invalid JSON in header: EOF while parsing a value at line 1 column 0
[2025-12-09 14:26:50 TP0] Found 1 corrupted file(s) for QuantTrio/DeepSeek-V3.2-AWQ: Corrupted shard files: ['model-00008-of-00121.safetensors']. Will selectively clean and re-download only these files.
[2025-12-09 14:26:50 TP0] Removed corrupted symlink: model-00008-of-00121.safetensors
[2025-12-09 14:26:50 TP0] Removed corrupted blob: 071a3348289365c723f283ce58c412d17b5b46184e03c653cf2032465d7aa31b
[2025-12-09 14:26:50 TP0] Removed 1 corrupted file(s) for QuantTrio/DeepSeek-V3.2-AWQ. These will be re-downloaded on next load.
[2025-12-09 14:26:50 TP0] HTTP Request: GET https://huggingface.co/api/models/QuantTrio/DeepSeek-V3.2-AWQ "HTTP/1.1 200 OK"
[2025-12-09 14:26:50 TP2] Removing entire cache for QuantTrio/DeepSeek-V3.2-AWQ at /home/kmod/.cache/huggingface/hub/models--QuantTrio--DeepSeek-V3.2-AWQ. Reason: Missing 1 file(s) from index model.safetensors.index.json: ['model-00008-of-00121.safetensors']
[2025-12-09 14:26:50 TP0] HTTP Request: GET https://huggingface.co/api/models/QuantTrio/DeepSeek-V3.2-AWQ/tree/main?recursive=false&expand=false "HTTP/1.1 200 OK"
[2025-12-09 14:26:50 TP0] Using model weights format ['*.safetensors']
[2025-12-09 14:26:50 TP0] HTTP Request: GET https://huggingface.co/api/models/QuantTrio/DeepSeek-V3.2-AWQ/revision/main "HTTP/1.1 200 OK"
[2025-12-09 14:26:50 TP0] HTTP Request: HEAD https://huggingface.co/QuantTrio/DeepSeek-V3.2-AWQ/resolve/340023cb6036c97c5c664ac944300e9d2b1a3f2e/model-00008-of-00121.safetensors "HTTP/1.1 302 Found"
[2025-12-09 14:26:50 TP0] HTTP Request: HEAD https://huggingface.co/QuantTrio/DeepSeek-V3.2-AWQ/resolve/340023cb6036c97c5c664ac944300e9d2b1a3f2e/model-00041-of-00121.safetensors "HTTP/1.1 302 Found"
[2025-12-09 14:26:51 TP0] HTTP Request: GET https://huggingface.co/api/models/QuantTrio/DeepSeek-V3.2-AWQ/xet-read-token/340023cb6036c97c5c664ac944300e9d2b1a3f2e "HTTP/1.1 200 OK"
[2025-12-09 14:26:51 TP0] HTTP Request: GET https://huggingface.co/api/models/QuantTrio/DeepSeek-V3.2-AWQ/xet-read-token/340023cb6036c97c5c664ac944300e9d2b1a3f2e "HTTP/1.1 200 OK"
[2025-12-09 14:27:07 TP2] Failed to remove corrupted cache directory /home/kmod/.cache/huggingface/hub/models--QuantTrio--DeepSeek-V3.2-AWQ: [Errno 39] Directory not empty: 'blobs'. Manual cleanup may be required.
[2025-12-09 14:27:07 TP3] Removing entire cache for QuantTrio/DeepSeek-V3.2-AWQ at /home/kmod/.cache/huggingface/hub/models--QuantTrio--DeepSeek-V3.2-AWQ. Reason: Incomplete download detected (2 incomplete files)
[2025-12-09 14:27:07 TP3] Successfully removed corrupted cache directory
[2025-12-09 14:27:08 TP1] HTTP Request: GET https://huggingface.co/api/models/QuantTrio/DeepSeek-V3.2-AWQ "HTTP/1.1 200 OK"
[2025-12-09 14:27:08 TP3] HTTP Request: GET https://huggingface.co/api/models/QuantTrio/DeepSeek-V3.2-AWQ "HTTP/1.1 200 OK"
[2025-12-09 14:27:08 TP2] HTTP Request: GET https://huggingface.co/api/models/QuantTrio/DeepSeek-V3.2-AWQ "HTTP/1.1 200 OK"
[2025-12-09 14:27:08 TP2] HTTP Request: GET https://huggingface.co/api/models/QuantTrio/DeepSeek-V3.2-AWQ/tree/main?recursive=false&expand=false "HTTP/1.1 200 OK"
[2025-12-09 14:27:08 TP3] HTTP Request: GET https://huggingface.co/api/models/QuantTrio/DeepSeek-V3.2-AWQ/tree/main?recursive=false&expand=false "HTTP/1.1 200 OK"
[2025-12-09 14:27:09 TP1] HTTP Request: GET https://huggingface.co/api/models/QuantTrio/DeepSeek-V3.2-AWQ/tree/main?recursive=false&expand=false "HTTP/1.1 200 OK"
[2025-12-09 14:27:20 TP0] Scheduler hit an exception: Traceback (most recent call last):
  [snip]
  File "/home/kmod/ai/.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py", line 429, in _inner_hf_hub_download
    hf_hub_download(  # type: ignore
  File "/home/kmod/ai/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 89, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/kmod/ai/.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1024, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kmod/ai/.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1240, in _hf_hub_download_to_cache_dir
    _download_to_tmp_and_move(
  File "/home/kmod/ai/.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1864, in _download_to_tmp_and_move
    xet_get(
  File "/home/kmod/ai/.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 588, in xet_get
    download_files(
RuntimeError: Data processing error: CAS service error : IO Error: No such file or directory (os error 2)

One thing to note is that this safetensors file is 3GiB, and my internet connection is 50MiB/s, so it was not possible that the file was actually fully redownloaded based on the fact that only 16s had elapsed. I suspect the 16s is more related to the time that it took to delete the ~360GiB of weights.

It looks like one of the blobs was corrupted, and sglang's intention was to delete and redownload the single blob. But this interacts poorly with the "delete the entire model if any files are missing" logic, to put it mildly. Even though the "delete the model if any files are missing" check comes before the "delete any corrupted files" check, multiple workers will execute these checks. I can see that the validation logic is successfully synchronized, and the model download logic is synchronized, but it seems the issue is that they use separate lock invocations, and even separate locks.

It seems what happened was

  • TP0 acquired the validation lock, saw that the file is corrupted, deleted it, released the validation lock, and started redownloading the file
  • TP2 immediately acquired the validation lock, saw that a file is missing from the index, and deleted the whole directory

My guess as to the easiest and safest thing to do would be to expand the lock inside download_weights_from_hf() to cover the entire function.


Also for what it's worth, automatically calling shutil.rmtree (aka rm -rf) on a user's machine seems excessively dangerous, especially when the data was provided by the user. In this case it was a public model which could be redownloaded, but this feels worryingly close to deleting a custom local model. Also the
rm -rf $DIR/../..
pattern in _cleanup_corrupted_model_cache() is only one directory layout change away from deleting a much larger portion of the user's filesystem like in the horror stories.

I think it would be preferable to simply alert the user to the unrecoverability of the cache and recommend that they delete it themselves. This code path should be executed only extremely rarely, which means both that this shouldn't cause too much burden on users and that it won't get the battle testing that I think is necessary for an automated destructive action such as this.

Reproduction

.venv/bin/hf download QuantTrio/DeepSeek-V3.2-AWQ

# simulated data corruption:
dd if=/dev/urandom conv=notrunc of=~/.cache/huggingface/hub/models--QuantTrio--DeepSeek-V3.2-AWQ/snapshots/340023cb6036c97c5c664ac944300e9d2b1a3f2e/model-00008-of-00121.safetensors bs=1M count=16

.venv/bin/python -m sglang.launch_server --model QuantTrio/DeepSeek-V3.2-AWQ --served-model-name QuantTrio/DeepSeek-V3.2-AWQ --host localhost --port 8000 --mem-fraction-static 0.95 --sleep-on-idle --tp=4 --context-length 32768 --attention-backend flashinfer --chunked-prefill-size 8192 --enable-mixed-chunk --cuda-graph-max-bs 1 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'

Environment

Python: 3.12.3 (main, Nov  6 2025, 13:44:16) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
GPU 0,1,2,3 Compute Capability: 12.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.8, V12.8.93
CUDA Driver Version: 580.95.05
PyTorch: 2.9.1+cu128
sglang: 0.5.6.post1
sgl_kernel: 0.3.19
flashinfer_python: 0.5.3
flashinfer_cubin: 0.5.3
flashinfer_jit_cache: Module Not Found
triton: 3.5.1
transformers: 5.0.0rc0
torchao: 0.9.0
numpy: 2.2.6
aiohttp: 3.13.2
fastapi: 0.123.9
hf_transfer: 0.1.9
huggingface_hub: 1.2.1
interegular: 0.3.3
modelscope: 1.32.0
orjson: 3.11.4
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.3
pydantic: 2.12.5
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.38.0
uvloop: 0.22.1
vllm: 0.12.0
xgrammar: 0.1.27
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.71.0
litellm: Module Not Found
decord2: 2.0.0
NVIDIA Topology: 
	GPU0	GPU1	GPU2	GPU3	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NODE	NODE	NODE	0-127	0		N/A
GPU1	NODE	 X 	NODE	NODE	0-127	0		N/A
GPU2	NODE	NODE	 X 	NODE	0-127	0		N/A
GPU3	NODE	NODE	NODE	 X 	0-127	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1024

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions