Checklist
Motivation
We're using a distributed file system to store LLM weights in a Kubernetes environment. As a typical design choice, the system is tuned for max parallelism, which behaves relatively poor with single-threaded, sequential reads. Through benchmarking, we found that model loading can be up to 5 times faster by using 8 threads, compared to the current performance of SGLang.
We hope there can be an option to enable parallelism while reading the model weights. It is not so useful for users who store their weights in a physical drive, but could be life-saving for users with distributed storage backend, including S3 (via S3FS).
Related resources
vLLM uses Run:ai Model Streamer for streaming models concurrently to GPUs: https://docs.vllm.ai/en/stable/models/extensions/runai_model_streamer.html
Triton also supports loading models in parallel: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_management.html#concurrently-loading-models
Checklist
Motivation
We're using a distributed file system to store LLM weights in a Kubernetes environment. As a typical design choice, the system is tuned for max parallelism, which behaves relatively poor with single-threaded, sequential reads. Through benchmarking, we found that model loading can be up to 5 times faster by using 8 threads, compared to the current performance of SGLang.
We hope there can be an option to enable parallelism while reading the model weights. It is not so useful for users who store their weights in a physical drive, but could be life-saving for users with distributed storage backend, including S3 (via S3FS).
Related resources
vLLM uses Run:ai Model Streamer for streaming models concurrently to GPUs: https://docs.vllm.ai/en/stable/models/extensions/runai_model_streamer.html
Triton also supports loading models in parallel: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_management.html#concurrently-loading-models