👋 hi and thanks again for all the updates and improvements on this framework.
I've tried running SGLang with AWQ version of LLaVA and ran into the following error:
$ python3 -m sglang.launch_server --model-path Shopify/llava-awq-test --tokenizer-path llava-hf/llava-1.5-7b-hf --host 0.0.0.0 --port 30000 --tp-size 1
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Rank 0: load weight begin.
quant_config: AWQConfig(weight_bits=4, group_size=128, zero_point=True)
/opt/conda/envs/sglang_awq/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
INFO 02-26 21:33:41 weight_utils.py:163] Using model weights format ['*.bin']
INFO 02-26 21:33:44 weight_utils.py:163] Using model weights format ['*.bin']
torch.Size([4096, 1536]) torch.Size([1024, 4096])
Process Process-1:
router init state: Traceback (most recent call last):
File "/home/gcpuser/sglang/python/sglang/srt/managers/router/manager.py", line 68, in start_router_process
model_client = ModelRpcClient(server_args, port_args)
File "/home/gcpuser/sglang/python/sglang/srt/managers/router/model_rpc.py", line 612, in __init__
self.model_server.exposed_init_model(0, server_args, port_args)
File "/home/gcpuser/sglang/python/sglang/srt/managers/router/model_rpc.py", line 62, in exposed_init_model
self.model_runner = ModelRunner(
File "/home/gcpuser/sglang/python/sglang/srt/managers/router/model_runner.py", line 275, in __init__
self.load_model()
File "/home/gcpuser/sglang/python/sglang/srt/managers/router/model_runner.py", line 308, in load_model
model.load_weights(
File "/home/gcpuser/sglang/python/sglang/srt/models/llava.py", line 292, in load_weights
self.language_model.load_weights(
File "/home/gcpuser/sglang/python/sglang/srt/models/llama2.py", line 311, in load_weights
weight_loader(param, loaded_weight, shard_id)
File "/opt/conda/envs/sglang_awq/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 436, in weight_loader
assert param_data.shape == loaded_weight.shape
AssertionError
detoken init state: init ok
I know that the error is originating from vllm but was wondering if SGLang should be able to support AWQ version of LLaVA?
Could it be something with how I quantized the LLaVA or is it incompatibility with SGLang / vLLM?
Details on creating the AWQ version
In order to create the AWQ 4-bit version I use the method described in the llm-awq for quantizing the model as in the VILA example. Converted the result to HF configuration model and tried to run it with SGLang.
$ python -m awq.entry --model_path /home/gcpuser/sky_workdir/llava-v1.5-7b --w_bit 4 --q_group_size 128 --run_awq --dump_awq /home/gcpuser/sky_workdir/awq_cache/llava-v1.5-7b-w4-g128.pt
Quantization config: {'zero_point': True, 'q_group_size': 128}
* Building model /home/gcpuser/sky_workdir/llava-v1.5-7b
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/opt/conda/envs/quantize_llava/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 5.61it/s]
/opt/conda/envs/quantize_llava/lib/python3.10/site-packages/huggingface_hub/repocard.py:105: UserWarning: Repo card metadata block was not found. Setting CardData to empty.
warnings.warn("Repo card metadata block was not found. Setting CardData to empty.")
Token indices sequence length is longer than the specified maximum sequence length for this model (8322 > 2048). Running this sequence through the model will result in indexing errors
* Split into 65 blocks
Running AWQ...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [08:33<00:00, 16.04s/it]
AWQ results saved at /home/gcpuser/sky_workdir/awq_cache/llava-v1.5-7b-w4-g128.pt
$ python -m awq.entry \
--model_path /home/gcpuser/sky_workdir/llava-v1.5-7b \
--w_bit 4 \
--q_group_size 128 \
--load_awq /home/gcpuser/sky_workdir/awq_cache/llava-v1.5-7b-w4-g128.pt \
--q_backend real \
--dump_quant /home/gcpuser/sky_workdir/quant_cache/llava-v1.5-7b-w4-g128-awq.pt
Quantization config: {'zero_point': True, 'q_group_size': 128}
* Building model /home/gcpuser/sky_workdir/llava-v1.5-7b
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/opt/conda/envs/quantize_llava/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 5.35it/s]
Loading pre-computed AWQ results from /home/gcpuser/sky_workdir/awq_cache/llava-v1.5-7b-w4-g128.pt
real weight quantization...: 100%|█████████████████████████████████████████████████████████████████████| 32/32 [03:58<00:00, 7.46s/it]
[Info] Auto-change the dump_quant file name to *v2.pt
Saving the quantized model at /home/gcpuser/sky_workdir/quant_cache/llava-v1.5-7b-w4-g128-awq-v2.pt...
Model config
{
"_name_or_path": "/home/gcpuser/sky_workdir/llava-v1.5-7b",
"architectures": [
"LlavaLlamaForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"freeze_mm_mlp_adapter": false,
"freeze_mm_vision_resampler": false,
"hidden_act": "silu",
"hidden_size": 4096,
"ignore_index": -100,
"image_aspect_ratio": "pad",
"image_token_index": 32000,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_length": 4096,
"max_position_embeddings": 4096,
"mm_hidden_size": 1024,
"mm_projector_type": "mlp2x_gelu",
"mm_resampler_type": null,
"mm_use_im_patch_token": false,
"mm_use_im_start_end": false,
"mm_vision_select_feature": "patch",
"mm_vision_select_layer": -2,
"mm_vision_tower": "openai/clip-vit-large-patch14-336",
"model_type": "llava",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"projector_hidden_act": "gelu",
"quantization_config": {
"backend": "llm-awq",
"bits": 4,
"do_fuse": false,
"fuse_max_seq_len": null,
"group_size": 128,
"modules_to_fuse": null,
"modules_to_not_convert": null,
"quant_method": "awq",
"version": "gemv",
"zero_point": true
},
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"text_config": {
"model_type": "llama"
},
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.37.2",
"tune_mm_mlp_adapter": false,
"tune_mm_vision_resampler": false,
"unfreeze_mm_vision_tower": false,
"use_cache": true,
"use_mm_proj": true,
"vision_config": {
"hidden_size": 1024,
"image_size": 336,
"intermediate_size": 4096,
"model_type": "clip_vision_model",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"patch_size": 14,
"projection_dim": 768,
"vocab_size": 32000
},
"vision_feature_layer": -2,
"vision_feature_select_strategy": "default",
"vocab_size": 32000
}
Environment details
- Server A100-80GB with 1 GPU
- CUDA cuda_12.1.r12.1/compiler.32688072_0
- SGLang built from source today (
git clone + pip install -e "python[all]")
- torch 2.1.2
- vllm 0.3.2
- transformers 4.38.1
👋 hi and thanks again for all the updates and improvements on this framework.
I've tried running SGLang with AWQ version of LLaVA and ran into the following error:
I know that the error is originating from
vllmbut was wondering if SGLang should be able to support AWQ version of LLaVA?Could it be something with how I quantized the LLaVA or is it incompatibility with SGLang / vLLM?
Details on creating the AWQ version
In order to create the AWQ 4-bit version I use the method described in the llm-awq for quantizing the model as in the VILA example. Converted the result to HF configuration model and tried to run it with SGLang.
Model config
{ "_name_or_path": "/home/gcpuser/sky_workdir/llava-v1.5-7b", "architectures": [ "LlavaLlamaForCausalLM" ], "bos_token_id": 1, "eos_token_id": 2, "freeze_mm_mlp_adapter": false, "freeze_mm_vision_resampler": false, "hidden_act": "silu", "hidden_size": 4096, "ignore_index": -100, "image_aspect_ratio": "pad", "image_token_index": 32000, "initializer_range": 0.02, "intermediate_size": 11008, "max_length": 4096, "max_position_embeddings": 4096, "mm_hidden_size": 1024, "mm_projector_type": "mlp2x_gelu", "mm_resampler_type": null, "mm_use_im_patch_token": false, "mm_use_im_start_end": false, "mm_vision_select_feature": "patch", "mm_vision_select_layer": -2, "mm_vision_tower": "openai/clip-vit-large-patch14-336", "model_type": "llava", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 32, "pad_token_id": 0, "pretraining_tp": 1, "projector_hidden_act": "gelu", "quantization_config": { "backend": "llm-awq", "bits": 4, "do_fuse": false, "fuse_max_seq_len": null, "group_size": 128, "modules_to_fuse": null, "modules_to_not_convert": null, "quant_method": "awq", "version": "gemv", "zero_point": true }, "rms_norm_eps": 1e-05, "rope_scaling": null, "text_config": { "model_type": "llama" }, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.37.2", "tune_mm_mlp_adapter": false, "tune_mm_vision_resampler": false, "unfreeze_mm_vision_tower": false, "use_cache": true, "use_mm_proj": true, "vision_config": { "hidden_size": 1024, "image_size": 336, "intermediate_size": 4096, "model_type": "clip_vision_model", "num_attention_heads": 16, "num_hidden_layers": 24, "patch_size": 14, "projection_dim": 768, "vocab_size": 32000 }, "vision_feature_layer": -2, "vision_feature_select_strategy": "default", "vocab_size": 32000 }Environment details
git clone+pip install -e "python[all]")