Skip to content

feat(remote_model): support variable remote backend for model loader#3964

Merged
merrymercy merged 3 commits intosgl-project:mainfrom
DellCurry:remote_model
Mar 14, 2025
Merged

feat(remote_model): support variable remote backend for model loader#3964
merrymercy merged 3 commits intosgl-project:mainfrom
DellCurry:remote_model

Conversation

@DellCurry
Copy link
Copy Markdown
Contributor

@DellCurry DellCurry commented Feb 28, 2025

Motivation

Similar as what I do in vllm support variable remote backend

Modifications

Background

Currently, one of the most general ways to load model is loading from local disk, which means user must firstly download model files from HF or cloud storage to local. Obviously it would waste lots of time especially for huge models.

Of course there are some ways to load directly from remote, such as remote filesystem like NFS. Those methods also have their own drawbacks on network speed and flexibility.

Besides, some organizations hope to use KV Database such as Redis to accelerate model loading. Our team has implemented a RDMA-based KV database which is much faster as following:
image

What this PR do

In order to provide more flexibility, I add a new ModelLoader class named RemoteModelLoader, and introduce a new module named Connector. RemoteModelLoader would create an Connector as its member. RemoteModelLoader would load model first and then fetch weight tensor one by one from Connector.

Connector has two types: KV for KV-database and FS for remote file storage. Both types must implement weight_iterator() to yield weight tensors and pull_files() to download model config flies. I have implemented RedisConnector as an example for KV-Connector (most of the serde part copied from LMCache).

KV-Connector could also be used for remote prefix cache in the future as what LMCache do.

TBD

If this pr proved to be helpful, I will fix following soon:

  • an S3Connector for S3 compatible remote backend as an example for FS-Connector
  • a script to save models weights tensor to remote KV database (Noticing that ShardedStateLoader also missing this script, this two scripts are very similar, maybe one commit for both)
  • possible unit tests and coding styles

Checklist

Copy link
Copy Markdown
Contributor

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Can you rebase and resolve the conflicts?

@DellCurry DellCurry force-pushed the remote_model branch 4 times, most recently from 97e1ed6 to e986407 Compare March 6, 2025 15:28
@DellCurry
Copy link
Copy Markdown
Contributor Author

DellCurry commented Mar 6, 2025

Looks good to me. Can you rebase and resolve the conflicts?

Hello @merrymercy , I have resolved confilcts, and also add 2 more commits: one for S3 connector and 2 scripts for save_model() by RemoteModelLoader and ShardedModelLoader (as long as an rpc framework for Engine which is used by scripts). May I re-request a review?

@DellCurry DellCurry requested a review from merrymercy March 10, 2025 06:04
@merrymercy
Copy link
Copy Markdown
Contributor

Can you fix the lint error? We still want to support python 3.9 so this syntax is not allowed.
https://github.com/sgl-project/sglang/actions/runs/13827770465/job/38685856096?pr=3964#step:5:48

@DellCurry
Copy link
Copy Markdown
Contributor Author

Can you fix the lint error? We still want to support python 3.9 so this syntax is not allowed. https://github.com/sgl-project/sglang/actions/runs/13827770465/job/38685856096?pr=3964#step:5:48

ok, I'm working on it.

@DellCurry DellCurry force-pushed the remote_model branch 2 times, most recently from ac5097a to 0733265 Compare March 13, 2025 06:23
Signed-off-by: wangyu <wangyu.steph@bytedance.com>
Signed-off-by: wangyu <wangyu.steph@bytedance.com>
Signed-off-by: wangyu <wangyu.steph@bytedance.com>
@merrymercy merrymercy merged commit 1ce4878 into sgl-project:main Mar 14, 2025
@leoliulei
Copy link
Copy Markdown

@DellCurry Excuse me, how can I use this feat? For example, I want to directly pull the model from S3.

@DellCurry
Copy link
Copy Markdown
Contributor Author

DellCurry commented Mar 14, 2025

@DellCurry Excuse me, how can I use this feat? For example, I want to directly pull the model from S3.

similar as vllm, here is an example for minio:

RUNAI_STREAMER_S3_USE_VIRTUAL_ADDRESSING=0 AWS_ENDPOINT_URL=http://127.0.0.1:9000  AWS_EC2_METADATA_DISABLED=true AWS_ACCESS_KEY_ID=minioadmin AWS_SECRET_ACCESS_KEY=minioadmin python3 -m sglang.launch_server --model-path s3://models/Meta-Llama-3-8B/ --port 8000 -tp-size 1

Of course, user can also modify the s3.py or even implement your own way if you do not want to use runai-streamer or you have your own preferred method.

hebiao064 pushed a commit to hebiao064/sglang that referenced this pull request Mar 14, 2025
zeroorhero pushed a commit to zeroorhero/sglang that referenced this pull request Mar 17, 2025
The new session cache has a framework as:

          /<---swap in----\         +------+
         /                 \        |entry0|
    +--------+         +--------+   +------+
    |sessionA|         |sessionX|---+ ...  |   +------------+
    +--------+         +--------+   +------+   |token offset|
        |                  |        |entryX|---+token length|
    +--------+         +--------+   +------+   |connector   |
    |sessionB|         |sessionY|              |uri         |
    +--------+         +--------+              +------------+
        |                  |
    +--------+         +--------+
    |sessionC|         |sessionZ|
    +--------+         +--------+
         \                 /
          \---swap out--->/

1> Use a LRU based list to manage sessions. From GPU HBM to extended
   memory is called 'swap out', the reverse direction is called 'swap
   in'. Then a single sglang instance is able to handle multiple
   sessions.

2> There are one or more 'session cache entry(s)' for a session,
   it's called 'session cache meta'. The session meta could be saved
   into several types of databases:
   a> memkv: a hash map in python. It has the same life cyle as sglang
      program. Once sglang restarts, sglang will lose the previous
      meta data. However, it does *NOT* have any dependence on other
      module.
   b> sglite(in developing). It will persist meta on the machine.
      sglang could load meta after program restarts. Considering that
      the local file based connector in use, it's possible to reload
      all the session meta and data after restarting.
   c> mysql,pgsql,etcd,redis(in developing): distributed database based
      session meta allows multiple sglang instances to share across
      cluster of machines. A request(or a new session) could be
      handled on any sglang instance, the kv-cache is also possible to
      save into/reload from distributed storage system[1], then we can
      benifit:
      * a higher cache hit: because both the session meta and kv-cache
        are saved into distributed database/storage, the session cache
        is expected to hit in a high rate.
      * easy to deploy/manage: sglang instance will be managed by k8s,
        because sglang has already offload the state (meta and kv-cache)
        to distributed system, it becomes *stateless*.
      * the stateless services prove higher CPU utilization from the
        'traditional' web services. We expect this will reproduce on
        the new distributed system.

3> There are token offset, token length, connector and uri fields in
   a single session cache entry.
   The 'connector' may be:
   a> local file directory. It will work fine with sqlite based meta.
   b> valkey, 3fs(and so on): bingo! *stateless* sglang service is
   coming!

Link[1]: commit 1ce4878("feat(remote_model): support variable remote backend for model loader (sgl-project#3964)")
Signed-off-by: Changqi Lu <luchangqi.123@bytedance.com>
Co-authored-by: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
# Shutdown the subprocesses automatically when the program exits
atexit.register(self.shutdown)

# Allocate ports for inter-process communications
Copy link
Copy Markdown
Contributor

@DevashishLal-CB DevashishLal-CB May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is required since we do the same in _launch_subprocess also this causes the server_args to never be logged as here the logger hasn't been configured and down in _launch_subprocess the if condition will never be true if called from Engine.init as we pass the port args

@DellCurry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants