[data.llm] Skip safetensor file downloads for runai streamer mode#55662
[data.llm] Skip safetensor file downloads for runai streamer mode#55662kouroshHakha merged 4 commits intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to skip downloading .safetensors files when using the runai_streamer mode. The changes are well-implemented by adding a runai_streamer flag that is propagated through the model downloading functions. A new suffixes_to_exclude parameter is added to download_files to filter out files based on their suffix, which is a clean way to implement the core logic. My feedback includes a minor suggestion to improve maintainability.
| ["tokenizer", "config.json"] if tokenizer_only else [] | ||
| ) | ||
|
|
||
| safetensors_to_exclude = [".safetensors"] if runai_streamer else None |
There was a problem hiding this comment.
kouroshHakha
left a comment
There was a problem hiding this comment.
Hi @jiangwu300 ,
Thanks for your contribution. Left some feedback.
| destination_path: Path where the model will be stored | ||
| bucket_uri: URI of the cloud directory containing the model | ||
| tokenizer_only: If True, only download tokenizer-related files | ||
| runai_streamer: If True, skip download of safetensor files |
There was a problem hiding this comment.
Shouldn't this usecase be already covered by using tokenizer_only=True?
| if self.max_pending_requests > 0: | ||
| logger.info("Max pending requests is set to %d", self.max_pending_requests) | ||
|
|
||
| runai_streamer = self.engine_kwargs.get("load_format") == "runai_streamer" |
There was a problem hiding this comment.
We shoud simply do download_model=....TokenizerOnly when load_format is those subset of formats that need model downloading to be skipped. There is also other load_formats that should be treated the same way like Tensorizer. Is there a better condition to use here?
There was a problem hiding this comment.
I don't think using tokenizer only would suffice (please correct me if I'm wrong) because certain models require specific files to exist in the model directory like Deepseek R1 (attached screenshot in comment above).
There was a problem hiding this comment.
gotcha, ok so would it make sense to define a new download_model enum called EXCLUDE_SAFETENSOR that will basically exclude *.safetensor ? We should make the api more generic than runai_streamer=True / False
There was a problem hiding this comment.
That makes sense. I made this change.
nrghosh
left a comment
There was a problem hiding this comment.
Thanks for the contribution!
Quick fixes for CI
- please sign off all commits (ex.
git commit -s -m "foobar") (will fix DCO check failure) - Please lint for code format (see
./ci/lint/lint.sh code_formatand./ci/lint/lint.sh pre_commit(will fix microcheck failure)
|
@nrghosh @kouroshHakha Are either of you guys able to escalate the vLLM issue (vllm-project/vllm#22843 (comment)) to the vLLM team? This fix still requires a fix on the vLLM front. |
|
runai_log.txt |
Signed-off-by: Jiang Wu <jwu@cclgroup.com>
Signed-off-by: Jiang Wu <JWu@cclgroup.com> Signed-off-by: Jiang Wu <jwu@cclgroup.com>
Signed-off-by: Jiang Wu <jwu@cclgroup.com>
f5f270b to
143422e
Compare
kouroshHakha
left a comment
There was a problem hiding this comment.
Running tests and release tests. Just one quick comment to address in parallel:
| if self.max_pending_requests > 0: | ||
| logger.info("Max pending requests is set to %d", self.max_pending_requests) | ||
|
|
||
| exclude_safetensors = self.engine_kwargs.get("load_format") == "runai_streamer" |
There was a problem hiding this comment.
thiis should be sth like .get(load_format, "auto") not in ["safetensors", "auto"]??
There was a problem hiding this comment.
I'm actually not sure which load formats should exclude safetensor downloads except that runai_streamer needs to skip it for sure.
There was a problem hiding this comment.
I think we can revise the list. Let's make it inclusive on runai_streamer and tensorizer(this is similar to runai_streamer). I know about these two for sure.
There was a problem hiding this comment.
Added tensorizer
Signed-off-by: Jiang Wu <jwu@cclgroup.com>
…y-project#55662) Signed-off-by: Jiang Wu <jwu@cclgroup.com> Signed-off-by: Jiang Wu <JWu@cclgroup.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
…5662) Signed-off-by: Jiang Wu <jwu@cclgroup.com> Signed-off-by: Jiang Wu <JWu@cclgroup.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…y-project#55662) Signed-off-by: Jiang Wu <jwu@cclgroup.com> Signed-off-by: Jiang Wu <JWu@cclgroup.com>

Why are these changes needed?
Ray LLM is downloading models to disk even when runai streamer is turned on, this causes significant startup overhead and network costs to download the model at scale. This change aims to first check for "load_format" in the engine kwargs, and skip the download of the safetensors files if runai streamer is on.
NOTE: vLLM currently has a bug when trying to load in models using runai streamer for the LLM and AsyncLLMEngine classes because it skips the download of the tokenizer and config.json. This PR needs to be in tandem with vLLM's issue fix for the functionality to work as expected.
vLLM Issue: vllm-project/vllm#22843
Related issue number
Closes #55574
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.