Skip to content

[data.llm] Skip safetensor file downloads for runai streamer mode#55662

Merged
kouroshHakha merged 4 commits intoray-project:masterfrom
jiangwu300:runai_fix_pr
Aug 24, 2025
Merged

[data.llm] Skip safetensor file downloads for runai streamer mode#55662
kouroshHakha merged 4 commits intoray-project:masterfrom
jiangwu300:runai_fix_pr

Conversation

@jiangwu300
Copy link
Copy Markdown
Contributor

@jiangwu300 jiangwu300 commented Aug 15, 2025

Why are these changes needed?

Ray LLM is downloading models to disk even when runai streamer is turned on, this causes significant startup overhead and network costs to download the model at scale. This change aims to first check for "load_format" in the engine kwargs, and skip the download of the safetensors files if runai streamer is on.

NOTE: vLLM currently has a bug when trying to load in models using runai streamer for the LLM and AsyncLLMEngine classes because it skips the download of the tokenizer and config.json. This PR needs to be in tandem with vLLM's issue fix for the functionality to work as expected.
vLLM Issue: vllm-project/vllm#22843

Related issue number

Closes #55574

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jiangwu300 jiangwu300 requested a review from a team as a code owner August 15, 2025 18:52
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to skip downloading .safetensors files when using the runai_streamer mode. The changes are well-implemented by adding a runai_streamer flag that is propagated through the model downloading functions. A new suffixes_to_exclude parameter is added to download_files to filter out files based on their suffix, which is a clean way to implement the core logic. My feedback includes a minor suggestion to improve maintainability.

["tokenizer", "config.json"] if tokenizer_only else []
)

safetensors_to_exclude = [".safetensors"] if runai_streamer else None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve maintainability and avoid using a magic string, consider defining ".safetensors" as a constant at the module level (e.g., SAFETENSORS_SUFFIX = ".safetensors") and referencing it here. This makes the code clearer and easier to update if the suffix ever needs to change.

@ray-gardener ray-gardener bot added serve Ray Serve Related Issue llm community-contribution Contributed by the community labels Aug 15, 2025
@kouroshHakha kouroshHakha changed the title Skip safetensor file downloads for runai streamer mode [data.llm] Skip safetensor file downloads for runai streamer mode Aug 18, 2025
Copy link
Copy Markdown
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jiangwu300 ,

Thanks for your contribution. Left some feedback.

destination_path: Path where the model will be stored
bucket_uri: URI of the cloud directory containing the model
tokenizer_only: If True, only download tokenizer-related files
runai_streamer: If True, skip download of safetensor files
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this usecase be already covered by using tokenizer_only=True?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed for models like Deepseek R1, it has files that aren't covered by the tokenizer only path, which means we still need to download everything else except for the safetensors.:
image

if self.max_pending_requests > 0:
logger.info("Max pending requests is set to %d", self.max_pending_requests)

runai_streamer = self.engine_kwargs.get("load_format") == "runai_streamer"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shoud simply do download_model=....TokenizerOnly when load_format is those subset of formats that need model downloading to be skipped. There is also other load_formats that should be treated the same way like Tensorizer. Is there a better condition to use here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think using tokenizer only would suffice (please correct me if I'm wrong) because certain models require specific files to exist in the model directory like Deepseek R1 (attached screenshot in comment above).

Copy link
Copy Markdown
Contributor

@kouroshHakha kouroshHakha Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, ok so would it make sense to define a new download_model enum called EXCLUDE_SAFETENSOR that will basically exclude *.safetensor ? We should make the api more generic than runai_streamer=True / False

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I made this change.

Copy link
Copy Markdown
Contributor

@nrghosh nrghosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

Quick fixes for CI

  1. please sign off all commits (ex. git commit -s -m "foobar") (will fix DCO check failure)
  2. Please lint for code format (see ./ci/lint/lint.sh code_format and ./ci/lint/lint.sh pre_commit (will fix microcheck failure)

@jiangwu300
Copy link
Copy Markdown
Contributor Author

@nrghosh @kouroshHakha Are either of you guys able to escalate the vLLM issue (vllm-project/vllm#22843 (comment)) to the vLLM team? This fix still requires a fix on the vLLM front.

@jiangwu300
Copy link
Copy Markdown
Contributor Author

runai_log.txt
Tested and works as expected ^

jiangwu300 and others added 3 commits August 20, 2025 13:30
Signed-off-by: Jiang Wu <jwu@cclgroup.com>
Signed-off-by: Jiang Wu <JWu@cclgroup.com>
Signed-off-by: Jiang Wu <jwu@cclgroup.com>
Signed-off-by: Jiang Wu <jwu@cclgroup.com>
@kouroshHakha kouroshHakha added the go add ONLY when ready to merge, run all tests label Aug 21, 2025
Copy link
Copy Markdown
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running tests and release tests. Just one quick comment to address in parallel:

if self.max_pending_requests > 0:
logger.info("Max pending requests is set to %d", self.max_pending_requests)

exclude_safetensors = self.engine_kwargs.get("load_format") == "runai_streamer"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thiis should be sth like .get(load_format, "auto") not in ["safetensors", "auto"]??

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not sure which load formats should exclude safetensor downloads except that runai_streamer needs to skip it for sure.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can revise the list. Let's make it inclusive on runai_streamer and tensorizer(this is similar to runai_streamer). I know about these two for sure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tensorizer

Copy link
Copy Markdown
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks

@kouroshHakha kouroshHakha merged commit 1818e21 into ray-project:master Aug 24, 2025
5 checks passed
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…y-project#55662)

Signed-off-by: Jiang Wu <jwu@cclgroup.com>
Signed-off-by: Jiang Wu <JWu@cclgroup.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
…5662)

Signed-off-by: Jiang Wu <jwu@cclgroup.com>
Signed-off-by: Jiang Wu <JWu@cclgroup.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…y-project#55662)

Signed-off-by: Jiang Wu <jwu@cclgroup.com>
Signed-off-by: Jiang Wu <JWu@cclgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests llm serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data/LLM] vLLM model files are downloaded to disk even when "load_format": "runai_streamer" is specified in engine_kwargs

3 participants