feat(remote_model): support variable remote backend for model loader by DellCurry · Pull Request #3964 · sgl-project/sglang

DellCurry · 2025-02-28T13:45:47Z

Motivation

Similar as what I do in vllm support variable remote backend

Modifications

Background

Currently, one of the most general ways to load model is loading from local disk, which means user must firstly download model files from HF or cloud storage to local. Obviously it would waste lots of time especially for huge models.

Of course there are some ways to load directly from remote, such as remote filesystem like NFS. Those methods also have their own drawbacks on network speed and flexibility.

Besides, some organizations hope to use KV Database such as Redis to accelerate model loading. Our team has implemented a RDMA-based KV database which is much faster as following:

What this PR do

In order to provide more flexibility, I add a new ModelLoader class named RemoteModelLoader, and introduce a new module named Connector. RemoteModelLoader would create an Connector as its member. RemoteModelLoader would load model first and then fetch weight tensor one by one from Connector.

Connector has two types: KV for KV-database and FS for remote file storage. Both types must implement weight_iterator() to yield weight tensors and pull_files() to download model config flies. I have implemented RedisConnector as an example for KV-Connector (most of the serde part copied from LMCache).

KV-Connector could also be used for remote prefix cache in the future as what LMCache do.

TBD

If this pr proved to be helpful, I will fix following soon:

an S3Connector for S3 compatible remote backend as an example for FS-Connector
a script to save models weights tensor to remote KV database (Noticing that ShardedStateLoader also missing this script, this two scripts are very similar, maybe one commit for both)
possible unit tests and coding styles

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

merrymercy

Looks good to me. Can you rebase and resolve the conflicts?

DellCurry · 2025-03-06T15:38:37Z

Looks good to me. Can you rebase and resolve the conflicts?

Hello @merrymercy , I have resolved confilcts, and also add 2 more commits: one for S3 connector and 2 scripts for save_model() by RemoteModelLoader and ShardedModelLoader (as long as an rpc framework for Engine which is used by scripts). May I re-request a review?

merrymercy · 2025-03-13T05:39:29Z

Can you fix the lint error? We still want to support python 3.9 so this syntax is not allowed.
https://github.com/sgl-project/sglang/actions/runs/13827770465/job/38685856096?pr=3964#step:5:48

DellCurry · 2025-03-13T05:42:37Z

Can you fix the lint error? We still want to support python 3.9 so this syntax is not allowed. https://github.com/sgl-project/sglang/actions/runs/13827770465/job/38685856096?pr=3964#step:5:48

ok, I'm working on it.

Signed-off-by: wangyu <wangyu.steph@bytedance.com>

leoliulei · 2025-03-14T09:02:54Z

@DellCurry Excuse me, how can I use this feat? For example, I want to directly pull the model from S3.

DellCurry · 2025-03-14T09:06:06Z

@DellCurry Excuse me, how can I use this feat? For example, I want to directly pull the model from S3.

similar as vllm, here is an example for minio:

RUNAI_STREAMER_S3_USE_VIRTUAL_ADDRESSING=0 AWS_ENDPOINT_URL=http://127.0.0.1:9000  AWS_EC2_METADATA_DISABLED=true AWS_ACCESS_KEY_ID=minioadmin AWS_SECRET_ACCESS_KEY=minioadmin python3 -m sglang.launch_server --model-path s3://models/Meta-Llama-3-8B/ --port 8000 -tp-size 1

Of course, user can also modify the s3.py or even implement your own way if you do not want to use runai-streamer or you have your own preferred method.

…gl-project#3964) Signed-off-by: wangyu <wangyu.steph@bytedance.com>

The new session cache has a framework as: /<---swap in----\ +------+ / \ |entry0| +--------+ +--------+ +------+ |sessionA| |sessionX|---+ ... | +------------+ +--------+ +--------+ +------+ |token offset| | | |entryX|---+token length| +--------+ +--------+ +------+ |connector | |sessionB| |sessionY| |uri | +--------+ +--------+ +------------+ | | +--------+ +--------+ |sessionC| |sessionZ| +--------+ +--------+ \ / \---swap out--->/ 1> Use a LRU based list to manage sessions. From GPU HBM to extended memory is called 'swap out', the reverse direction is called 'swap in'. Then a single sglang instance is able to handle multiple sessions. 2> There are one or more 'session cache entry(s)' for a session, it's called 'session cache meta'. The session meta could be saved into several types of databases: a> memkv: a hash map in python. It has the same life cyle as sglang program. Once sglang restarts, sglang will lose the previous meta data. However, it does *NOT* have any dependence on other module. b> sglite(in developing). It will persist meta on the machine. sglang could load meta after program restarts. Considering that the local file based connector in use, it's possible to reload all the session meta and data after restarting. c> mysql,pgsql,etcd,redis(in developing): distributed database based session meta allows multiple sglang instances to share across cluster of machines. A request(or a new session) could be handled on any sglang instance, the kv-cache is also possible to save into/reload from distributed storage system[1], then we can benifit: * a higher cache hit: because both the session meta and kv-cache are saved into distributed database/storage, the session cache is expected to hit in a high rate. * easy to deploy/manage: sglang instance will be managed by k8s, because sglang has already offload the state (meta and kv-cache) to distributed system, it becomes *stateless*. * the stateless services prove higher CPU utilization from the 'traditional' web services. We expect this will reproduce on the new distributed system. 3> There are token offset, token length, connector and uri fields in a single session cache entry. The 'connector' may be: a> local file directory. It will work fine with sqlite based meta. b> valkey, 3fs(and so on): bingo! *stateless* sglang service is coming! Link[1]: commit 1ce4878("feat(remote_model): support variable remote backend for model loader (sgl-project#3964)") Signed-off-by: Changqi Lu <luchangqi.123@bytedance.com> Co-authored-by: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>

DevashishLal-CB · 2025-05-08T22:18:08Z

        # Shutdown the subprocesses automatically when the program exits
        atexit.register(self.shutdown)

+        # Allocate ports for inter-process communications


I don't think this is required since we do the same in _launch_subprocess also this causes the server_args to never be logged as here the logger hasn't been configured and down in _launch_subprocess the if condition will never be true if called from Engine.init as we pass the port args

@DellCurry

DellCurry requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners February 28, 2025 13:45

merrymercy approved these changes Mar 4, 2025

View reviewed changes

DellCurry force-pushed the remote_model branch 4 times, most recently from 97e1ed6 to e986407 Compare March 6, 2025 15:28

DellCurry requested a review from merrymercy March 10, 2025 06:04

DellCurry force-pushed the remote_model branch 2 times, most recently from ac5097a to 0733265 Compare March 13, 2025 06:23

DellCurry mentioned this pull request Mar 13, 2025

Support for saving sharded checkpoints? #3209

Closed

DellCurry added 2 commits March 13, 2025 20:03

feat(remote_model): support variable remote backend for model loader

9b5d25d

Signed-off-by: wangyu <wangyu.steph@bytedance.com>

feat(connector): support s3 connector

c068e19

Signed-off-by: wangyu <wangyu.steph@bytedance.com>

DellCurry force-pushed the remote_model branch from 0733265 to b92380e Compare March 13, 2025 12:16

merrymercy added the high priority label Mar 13, 2025

feat(save_model): add scripts and rpc to save remote and sharded model

d5a02bd

Signed-off-by: wangyu <wangyu.steph@bytedance.com>

DellCurry force-pushed the remote_model branch from 5ba008d to d5a02bd Compare March 14, 2025 03:42

merrymercy merged commit 1ce4878 into sgl-project:main Mar 14, 2025

hebiao064 pushed a commit to hebiao064/sglang that referenced this pull request Mar 14, 2025

feat(remote_model): support variable remote backend for model loader (s…

d422921

…gl-project#3964) Signed-off-by: wangyu <wangyu.steph@bytedance.com>

This was referenced Mar 20, 2025

fix save_sharded_state error with --max-file-size #4634

Closed

[Feature] RemoteModelLoader should support pull sharded model #4762

Closed

support sharded for remote model load-format #4763

Closed

DevashishLal-CB reviewed May 8, 2025

View reviewed changes

DellCurry mentioned this pull request Jun 9, 2025

[Feature] support rpc in dp_size > 1 #6988

Closed

2 tasks

This was referenced Jul 4, 2025

[RFC] Remote KV Connector for SGLang Global Cache Reuse and PD #7746

Closed

Hicache Storage Layer Prototype #7704

Merged

DellCurry mentioned this pull request Sep 1, 2025

feat: support rpc when dp_rank>1 #6989

Open

6 tasks

b8zhong mentioned this pull request Nov 3, 2025

[Feature] Multiple model weight loading improvements #12529

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(remote_model): support variable remote backend for model loader#3964

feat(remote_model): support variable remote backend for model loader#3964
merrymercy merged 3 commits intosgl-project:mainfrom
DellCurry:remote_model

DellCurry commented Feb 28, 2025 •

edited

Loading

Uh oh!

merrymercy left a comment

Uh oh!

DellCurry commented Mar 6, 2025 •

edited

Loading

Uh oh!

merrymercy commented Mar 13, 2025

Uh oh!

DellCurry commented Mar 13, 2025

Uh oh!

leoliulei commented Mar 14, 2025

Uh oh!

DellCurry commented Mar 14, 2025 •

edited

Loading

Uh oh!

DevashishLal-CB May 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

DellCurry commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Background

What this PR do

TBD

Checklist

Uh oh!

merrymercy left a comment

Choose a reason for hiding this comment

Uh oh!

DellCurry commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

merrymercy commented Mar 13, 2025

Uh oh!

DellCurry commented Mar 13, 2025

Uh oh!

leoliulei commented Mar 14, 2025

Uh oh!

DellCurry commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DevashishLal-CB May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DellCurry commented Feb 28, 2025 •

edited

Loading

DellCurry commented Mar 6, 2025 •

edited

Loading

DellCurry commented Mar 14, 2025 •

edited

Loading

DevashishLal-CB May 8, 2025 •

edited

Loading