feat(remote_model): support variable remote backend for model loader#3964
feat(remote_model): support variable remote backend for model loader#3964merrymercy merged 3 commits intosgl-project:mainfrom
Conversation
merrymercy
left a comment
There was a problem hiding this comment.
Looks good to me. Can you rebase and resolve the conflicts?
97e1ed6 to
e986407
Compare
Hello @merrymercy , I have resolved confilcts, and also add 2 more commits: one for |
|
Can you fix the lint error? We still want to support python 3.9 so this syntax is not allowed. |
ok, I'm working on it. |
ac5097a to
0733265
Compare
Signed-off-by: wangyu <wangyu.steph@bytedance.com>
Signed-off-by: wangyu <wangyu.steph@bytedance.com>
Signed-off-by: wangyu <wangyu.steph@bytedance.com>
|
@DellCurry Excuse me, how can I use this feat? For example, I want to directly pull the model from S3. |
similar as Of course, user can also modify the |
…gl-project#3964) Signed-off-by: wangyu <wangyu.steph@bytedance.com>
The new session cache has a framework as:
/<---swap in----\ +------+
/ \ |entry0|
+--------+ +--------+ +------+
|sessionA| |sessionX|---+ ... | +------------+
+--------+ +--------+ +------+ |token offset|
| | |entryX|---+token length|
+--------+ +--------+ +------+ |connector |
|sessionB| |sessionY| |uri |
+--------+ +--------+ +------------+
| |
+--------+ +--------+
|sessionC| |sessionZ|
+--------+ +--------+
\ /
\---swap out--->/
1> Use a LRU based list to manage sessions. From GPU HBM to extended
memory is called 'swap out', the reverse direction is called 'swap
in'. Then a single sglang instance is able to handle multiple
sessions.
2> There are one or more 'session cache entry(s)' for a session,
it's called 'session cache meta'. The session meta could be saved
into several types of databases:
a> memkv: a hash map in python. It has the same life cyle as sglang
program. Once sglang restarts, sglang will lose the previous
meta data. However, it does *NOT* have any dependence on other
module.
b> sglite(in developing). It will persist meta on the machine.
sglang could load meta after program restarts. Considering that
the local file based connector in use, it's possible to reload
all the session meta and data after restarting.
c> mysql,pgsql,etcd,redis(in developing): distributed database based
session meta allows multiple sglang instances to share across
cluster of machines. A request(or a new session) could be
handled on any sglang instance, the kv-cache is also possible to
save into/reload from distributed storage system[1], then we can
benifit:
* a higher cache hit: because both the session meta and kv-cache
are saved into distributed database/storage, the session cache
is expected to hit in a high rate.
* easy to deploy/manage: sglang instance will be managed by k8s,
because sglang has already offload the state (meta and kv-cache)
to distributed system, it becomes *stateless*.
* the stateless services prove higher CPU utilization from the
'traditional' web services. We expect this will reproduce on
the new distributed system.
3> There are token offset, token length, connector and uri fields in
a single session cache entry.
The 'connector' may be:
a> local file directory. It will work fine with sqlite based meta.
b> valkey, 3fs(and so on): bingo! *stateless* sglang service is
coming!
Link[1]: commit 1ce4878("feat(remote_model): support variable remote backend for model loader (sgl-project#3964)")
Signed-off-by: Changqi Lu <luchangqi.123@bytedance.com>
Co-authored-by: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
| # Shutdown the subprocesses automatically when the program exits | ||
| atexit.register(self.shutdown) | ||
|
|
||
| # Allocate ports for inter-process communications |
There was a problem hiding this comment.
I don't think this is required since we do the same in _launch_subprocess also this causes the server_args to never be logged as here the logger hasn't been configured and down in _launch_subprocess the if condition will never be true if called from Engine.init as we pass the port args
Motivation
Similar as what I do in vllm support variable remote backend
Modifications
Background
Currently, one of the most general ways to load model is loading from local disk, which means user must firstly download model files from HF or cloud storage to local. Obviously it would waste lots of time especially for huge models.
Of course there are some ways to load directly from remote, such as remote filesystem like NFS. Those methods also have their own drawbacks on network speed and flexibility.
Besides, some organizations hope to use KV Database such as

Redisto accelerate model loading. Our team has implemented a RDMA-based KV database which is much faster as following:What this PR do
In order to provide more flexibility, I add a new
ModelLoaderclass namedRemoteModelLoader, and introduce a new module namedConnector.RemoteModelLoaderwould create an Connector as its member.RemoteModelLoaderwould load model first and then fetch weight tensor one by one fromConnector.Connectorhas two types:KVfor KV-database andFSfor remote file storage. Both types must implementweight_iterator()to yield weight tensors andpull_files()to download model config flies. I have implementedRedisConnectoras an example forKV-Connector(most of theserdepart copied from LMCache).KV-Connectorcould also be used for remote prefix cache in the future as whatLMCachedo.TBD
If this pr proved to be helpful, I will fix following soon:
S3ConnectorforS3compatible remote backend as an example forFS-ConnectorShardedStateLoaderalso missing this script, this two scripts are very similar, maybe onecommitfor both)Checklist