Checklist
Describe the bug
As described in the title. I am wondering if users are required to do an explicit weight update before generating. Is the following expected behavior? Thank you!
Reproduction
import torch
import sglang as sgl
def main_sglang():
server_args = {
"model_path": "Qwen/Qwen3-0.6B",
"disable_cuda_graph": True, # does not affect behavior
"tp_size": 1,
"enable_memory_saver": True,
}
llm = sgl.Engine(**server_args)
print(f"Free GPU memory before sleep: {torch.cuda.mem_get_info()[0] / 1024**2:.1f} MB")
llm.release_memory_occupation()
print(f"Free GPU memory after sleep: {torch.cuda.mem_get_info()[0] / 1024**2:.1f} MB")
llm.resume_memory_occupation()
print(f"Free GPU memory after wake up: {torch.cuda.mem_get_info()[0] / 1024**2:.1f} MB")
outputs = llm.generate("Hello, my name is", {"max_new_tokens": 10})
print(outputs['text'])
if __name__ == "__main__":
main_sglang()
Output:
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.48it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.48it/s]
Free GPU memory before sleep: 8338.2 MB
Free GPU memory after sleep: 38830.2 MB
Free GPU memory after wake up: 8318.2 MB
هذهn屋쨌养老保险 admins Aerospace.lex العسكري والتي
Environment
Environment:
- A100 40GB
- Tried SGLang build-from-source at head (766392c), pip-installed 0.4.9.post1, 0.4.8.post1
Checklist
Describe the bug
As described in the title. I am wondering if users are required to do an explicit weight update before generating. Is the following expected behavior? Thank you!
Reproduction
Output:
Environment
Environment: