Skip to content

fix: use fp16 dtype for sm75#1136

Merged
zhyncs merged 1 commit intosgl-project:mainfrom
zhyncs:sm75
Aug 17, 2024
Merged

fix: use fp16 dtype for sm75#1136
zhyncs merged 1 commit intosgl-project:mainfrom
zhyncs:sm75

Conversation

@zhyncs
Copy link
Copy Markdown
Collaborator

@zhyncs zhyncs commented Aug 17, 2024

Motivation

Modification

Checklist

  • Before submitting a PR for review, make sure it has passed verification in your local development environment at least.
  • Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
  • Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
  • Modify documentation as needed, such as docstrings or example tutorials.

@zhyncs zhyncs marked this pull request as draft August 17, 2024 13:07
@zhyncs zhyncs added the wip label Aug 17, 2024
@zhyncs zhyncs removed the wip label Aug 17, 2024
@zhyncs zhyncs marked this pull request as ready for review August 17, 2024 14:27
@zhyncs
Copy link
Copy Markdown
Collaborator Author

zhyncs commented Aug 17, 2024

tested with GCP T4

(base) root@hostname:/home/me/sglang# python3 -m sglang.launch_server --model Qwen/Qwen1.5-1.8B-Chat --disable-flashinfer-sampling --mem-frac 0.7
server_args=ServerArgs(model_path='Qwen/Qwen1.5-1.8B-Chat', tokenizer_path='Qwen/Qwen1.5-1.8B-Chat', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', trust_remote_code=False, context_length=None, quantization=None, served_model_name='Qwen/Qwen1.5-1.8B-Chat', chat_template=None, host='127.0.0.1', port=30000, additional_ports=[30001, 30002, 30003, 30004], mem_fraction_static=0.7, max_running_requests=None, max_num_reqs=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=593901843, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=True, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_disk_cache=False, enable_mixed_chunk=False, enable_torch_compile=False, enable_p2p_check=False, enable_mla=False, attention_reduce_in_fp32=False, efficient_weight_load=False, nccl_init_addr=None, nnodes=1, node_rank=None)
[gpu=0] Init nccl begin.
[gpu=0] Load weight begin. avail mem=14.47 GB
Compute capability below sm80 use float16 due to lack of bfloat16 support.
INFO 08-17 14:35:09 weight_utils.py:225] Using model weights format ['*.safetensors']
INFO 08-17 14:35:09 weight_utils.py:269] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.79s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.79s/it]

[gpu=0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=10.93 GB
[gpu=0] Memory pool end. avail mem=4.02 GB
[gpu=0] Capture cuda graph begin. This can take up to several minutes.
[gpu=0] max_total_num_tokens=35991, max_prefill_tokens=16384, max_running_requests=2047, context_len=32768
INFO:     Started server process [226684]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
INFO:     127.0.0.1:57388 - "GET /get_model_info HTTP/1.1" 200 OK
[gpu=0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0
INFO:     127.0.0.1:57398 - "POST /generate HTTP/1.1" 200 OK
The server is fired up and ready to roll!

Comment thread python/sglang/srt/model_executor/model_runner.py
@zhyncs zhyncs merged commit 9208591 into sgl-project:main Aug 17, 2024
@zhyncs zhyncs deleted the sm75 branch August 17, 2024 14:45
@zhyncs
Copy link
Copy Markdown
Collaborator Author

zhyncs commented Aug 17, 2024

The reason this check is not added in check_server_args is because there will be a Cannot re-initialize CUDA in forked subprocess.

@zhyncs zhyncs mentioned this pull request Aug 17, 2024
4 tasks
timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant