Running LLaVA 1.5 4bit AWQ with SGLang

👋 hi and thanks again for all the updates and improvements on this framework.

I've tried running SGLang with AWQ version of LLaVA and ran into the following error:

```console
$ python3 -m sglang.launch_server --model-path Shopify/llava-awq-test --tokenizer-path llava-hf/llava-1.5-7b-hf --host 0.0.0.0 --port 30000 --tp-size 1
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Rank 0: load weight begin.
quant_config: AWQConfig(weight_bits=4, group_size=128, zero_point=True)
/opt/conda/envs/sglang_awq/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
INFO 02-26 21:33:41 weight_utils.py:163] Using model weights format ['*.bin']
INFO 02-26 21:33:44 weight_utils.py:163] Using model weights format ['*.bin']
torch.Size([4096, 1536]) torch.Size([1024, 4096])
Process Process-1:
router init state: Traceback (most recent call last):
  File "/home/gcpuser/sglang/python/sglang/srt/managers/router/manager.py", line 68, in start_router_process
    model_client = ModelRpcClient(server_args, port_args)
  File "/home/gcpuser/sglang/python/sglang/srt/managers/router/model_rpc.py", line 612, in __init__
    self.model_server.exposed_init_model(0, server_args, port_args)
  File "/home/gcpuser/sglang/python/sglang/srt/managers/router/model_rpc.py", line 62, in exposed_init_model
    self.model_runner = ModelRunner(
  File "/home/gcpuser/sglang/python/sglang/srt/managers/router/model_runner.py", line 275, in __init__
    self.load_model()
  File "/home/gcpuser/sglang/python/sglang/srt/managers/router/model_runner.py", line 308, in load_model
    model.load_weights(
  File "/home/gcpuser/sglang/python/sglang/srt/models/llava.py", line 292, in load_weights
    self.language_model.load_weights(
  File "/home/gcpuser/sglang/python/sglang/srt/models/llama2.py", line 311, in load_weights
    weight_loader(param, loaded_weight, shard_id)
  File "/opt/conda/envs/sglang_awq/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 436, in weight_loader
    assert param_data.shape == loaded_weight.shape
AssertionError

detoken init state: init ok
```

I know that the error is originating from `vllm` but was wondering if SGLang should be able to support AWQ version of LLaVA?
Could it be something with how I quantized the LLaVA or is it incompatibility with SGLang / vLLM?

## Details on creating the AWQ version

In order to create the AWQ 4-bit version I use the method described in the [llm-awq](https://github.com/mit-han-lab/llm-awq) for quantizing the model as in the [VILA example](https://github.com/mit-han-lab/llm-awq/blob/main/scripts/vila_example.sh). Converted the result to HF configuration model and tried to run it with SGLang.

```console
$ python -m awq.entry     --model_path /home/gcpuser/sky_workdir/llava-v1.5-7b     --w_bit 4     --q_group_size 128     --run_awq     --dump_awq /home/gcpuser/sky_workdir/awq_cache/llava-v1.5-7b-w4-g128.pt
Quantization config: {'zero_point': True, 'q_group_size': 128}
* Building model /home/gcpuser/sky_workdir/llava-v1.5-7b
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards:   0%|                                                                                                                                                                                                                          | 0/2 [00:00<?, ?it/s]/opt/conda/envs/quantize_llava/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.61it/s]
/opt/conda/envs/quantize_llava/lib/python3.10/site-packages/huggingface_hub/repocard.py:105: UserWarning: Repo card metadata block was not found. Setting CardData to empty.
  warnings.warn("Repo card metadata block was not found. Setting CardData to empty.")
Token indices sequence length is longer than the specified maximum sequence length for this model (8322 > 2048). Running this sequence through the model will result in indexing errors
 * Split into 65 blocks
Running AWQ...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [08:33<00:00, 16.04s/it]
AWQ results saved at /home/gcpuser/sky_workdir/awq_cache/llava-v1.5-7b-w4-g128.pt
```

```console
$ python -m awq.entry \
    --model_path /home/gcpuser/sky_workdir/llava-v1.5-7b \
    --w_bit 4 \
    --q_group_size 128 \
    --load_awq /home/gcpuser/sky_workdir/awq_cache/llava-v1.5-7b-w4-g128.pt \
    --q_backend real \
    --dump_quant /home/gcpuser/sky_workdir/quant_cache/llava-v1.5-7b-w4-g128-awq.pt
Quantization config: {'zero_point': True, 'q_group_size': 128}
* Building model /home/gcpuser/sky_workdir/llava-v1.5-7b
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards:   0%|                                                                                 | 0/2 [00:00<?, ?it/s]/opt/conda/envs/quantize_llava/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.35it/s]
Loading pre-computed AWQ results from /home/gcpuser/sky_workdir/awq_cache/llava-v1.5-7b-w4-g128.pt
real weight quantization...: 100%|█████████████████████████████████████████████████████████████████████| 32/32 [03:58<00:00,  7.46s/it]
[Info] Auto-change the dump_quant file name to *v2.pt
Saving the quantized model at /home/gcpuser/sky_workdir/quant_cache/llava-v1.5-7b-w4-g128-awq-v2.pt...
```

## Model config

```json
{
  "_name_or_path": "/home/gcpuser/sky_workdir/llava-v1.5-7b",
  "architectures": [
    "LlavaLlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "freeze_mm_mlp_adapter": false,
  "freeze_mm_vision_resampler": false,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "ignore_index": -100,
  "image_aspect_ratio": "pad",
  "image_token_index": 32000,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_length": 4096,
  "max_position_embeddings": 4096,
  "mm_hidden_size": 1024,
  "mm_projector_type": "mlp2x_gelu",
  "mm_resampler_type": null,
  "mm_use_im_patch_token": false,
  "mm_use_im_start_end": false,
  "mm_vision_select_feature": "patch",
  "mm_vision_select_layer": -2,
  "mm_vision_tower": "openai/clip-vit-large-patch14-336",
  "model_type": "llava",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "projector_hidden_act": "gelu",
  "quantization_config": {
    "backend": "llm-awq",
    "bits": 4,
    "do_fuse": false,
    "fuse_max_seq_len": null,
    "group_size": 128,
    "modules_to_fuse": null,
    "modules_to_not_convert": null,
    "quant_method": "awq",
    "version": "gemv",
    "zero_point": true
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "text_config": {
    "model_type": "llama"
  },
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.37.2",
  "tune_mm_mlp_adapter": false,
  "tune_mm_vision_resampler": false,
  "unfreeze_mm_vision_tower": false,
  "use_cache": true,
  "use_mm_proj": true,
  "vision_config": {
    "hidden_size": 1024,
    "image_size": 336,
    "intermediate_size": 4096,
    "model_type": "clip_vision_model",
    "num_attention_heads": 16,
    "num_hidden_layers": 24,
    "patch_size": 14,
    "projection_dim": 768,
    "vocab_size": 32000
  },
  "vision_feature_layer": -2,
  "vision_feature_select_strategy": "default",
  "vocab_size": 32000
}
```

## Environment details

* Server A100-80GB with 1 GPU
* CUDA cuda_12.1.r12.1/compiler.32688072_0
* SGLang built from source today  (`git clone` + `pip install -e "python[all]"`)
* torch 2.1.2
* vllm 0.3.2
* transformers 4.38.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running LLaVA 1.5 4bit AWQ with SGLang #237

Details on creating the AWQ version

Model config

Environment details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Running LLaVA 1.5 4bit AWQ with SGLang #237

Description

Details on creating the AWQ version

Model config

Environment details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions