[Bug] KeyError: 'lm_head.weight' when loading quantized llama 3.2 3B and 1B models

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

The issue arises when I try to load quantized models of llama 3.2 models of sizes 3B and 1B models. This doesnot happen with llama 3.1 8B model. When I launch the quantized model "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8" using sglang docker, the following error is raised. The same model is loaded properly in VLLM.

```
[2025-01-16 21:39:23 TP0] Init torch distributed begin.
[2025-01-16 21:39:23 TP0] Load weight begin. avail mem=21.73 GB
INFO 01-16 21:39:24 compressed_tensors_wNa16.py:83] Using MarlinLinearKernel for CompressedTensorsWNA16
^MLoading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
[2025-01-16 21:39:24 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1652, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 209, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 176, in __init__
    self.load_model()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 281, in load_model
    self.model = get_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 362, in load_model
    model.load_weights(self._get_all_weights(model_config, model))
  File "/sgl-workspace/sglang/python/sglang/srt/models/llama.py", line 477, in load_weights
    param = params_dict[name]
KeyError: 'lm_head.weight'
```


The error seems to be that in llama 3.2 3B and 1B models, the lm_head weight and embed_tokens weight are tied. But the quantization libraries store the copy of lm_head while quantization (tried using both AutoGPTQ and llm-compressor). When this model is loaded, the lm_head.weight is being tried to load, when the parameter is not there in the model definition because of tied weights. This raised the error that lm_head.weight is there in state_dict, but not in the defined model parameters.

I have found a related issue in VLLM:
[https://github.com/vllm-project/vllm/pull/3553](https://github.com/vllm-project/vllm/pull/3553)

The following [code](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py) in VLLM handles this usecase :
```

def load_weights(self, weights: Iterable[Tuple[str,
                                                   torch.Tensor]]) -> Set[str]:
        loader = AutoWeightsLoader(
            self,
            skip_prefixes=(["lm_head."]
                           if self.config.tie_word_embeddings else None),
        )
        return loader.load_weights(
            self.maybe_remap_mistral(name, loaded_weight)
            for name, loaded_weight in weights)

```

To run the model on sglang:


```

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v $HF_HOME:/root/.cache/huggingface \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8 \
        --context-length 8192 \
        --served-model-name model \
        --host 0.0.0.0 --port 30000 \
        --mem-fraction-static 0.85 \
        --max-running-requests 64 \
        --grammar-backend xgrammar 
```

To run the same model on VLLM:
`
vllm serve neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8
`

Thanks for the great repo.





### Reproduction

```
docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v $HF_HOME:/root/.cache/huggingface \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8 \
        --context-length 8192 \
        --served-model-name model \
        --host 0.0.0.0 --port 30000 \
        --mem-fraction-static 0.85 \
        --max-running-requests 64 \
        --grammar-backend xgrammar 

```

### Environment

I am using the latest sglang docker image to run the models.

```

[2025-01-16 22:45:47] INFO _client.py:1038: HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
* 'fields' has been removed
  warnings.warn(message, UserWarning)
Python: 3.10.16 (main, Dec  4 2024, 08:53:37) [GCC 9.4.0]
CUDA available: True
GPU 0: NVIDIA A10G
GPU 0 Compute Capability: 8.6
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.127.05
PyTorch: 2.5.1+cu124
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.48.0
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.27.1
interegular: 0.3.3
modelscope: 1.22.1
orjson: 3.10.14
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.5
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.59.7
anthropic: 0.43.0
decord: 0.6.0
NVIDIA Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-31    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM
ulimit soft: 32768
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] KeyError: 'lm_head.weight' when loading quantized llama 3.2 3B and 1B models #2935

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] KeyError: 'lm_head.weight' when loading quantized llama 3.2 3B and 1B models #2935

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions