Skip to content

[Bug] Serving Quantized Llama 4 Scout with Sglang #9758

@pdasgup

Description

@pdasgup

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I am unable to load the new llama 4 scout fp8 checkpoint published by nvidia in Sglang. They have recently updated the checkpoint to support the vision layers [commit]. I am using image lmsysorg/sglang:v0.5.0rc2-cu126. Is this quantization format supported by Sglang? The reproduction steps are below.

Reproduction

Using image lmsysorg/sglang:v0.5.0rc2-cu126 on 8xH100, I ran the following command python3 -m sglang.launch_server --port=8000 --model-path=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 --tp=8 --trust-remote-code --mem-fraction-static 0.7 --context-length=131072 --attention-backend=fa3 --enable-multimodal --tool-call-parser=pythonic --chat-template=llama-4 --cuda-graph-max-bs=32 --host=0.0.0.0 --quantization=modelopt but i see the following stack track at startup

File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 444, in load_model
    model = _initialize_model(
            ^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 183, in _initialize_model
    quant_config = _get_quantization_config(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 127, in _get_quantization_config
    quant_config = get_quant_config(
                   ^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/weight_utils.py", line 156, in get_quant_config
    return quant_cls.from_config(hf_quant_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py", line 108, in from_config
    quant_method = cls.get_from_keys(config, ["quantization"]).get("quant_algo")
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/base_config.py", line 170, in get_from_keys
    raise ValueError(
ValueError: Cannot find any of ['quantization'] in the model's quantization config.

[2025-08-27 20:22:11] Received sigquit from a child process. It usually means the child failed.
Killed

Environment

prithudasgupta_google_com@prithudasgupta-a3-highgpu-h100-us-central1:~$ docker run --gpus all -v /home/prithudasgupta_google_com/models:/models --shm-size=32g -e NVIDIA_DISABLE_REQUIRE=1 --rm --name "sgl" -p 8000:8000 --entrypoint /bin/bash -it lmsysorg/sglang:v0.5.0rc2-cu126
root@65ac9e6fca22:/sgl-workspace/sglang# python3 -m sglang.check_env
Python: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.6, V12.6.68
CUDA Driver Version: 550.90.07
PyTorch: 2.8.0+cu126
sglang: 0.5.0rc2
sgl_kernel: 0.3.5
flashinfer_python: 0.2.11.post3
triton: 3.4.0
transformers: 4.55.2
torchao: 0.9.0+cu126
numpy: 2.3.2
aiohttp: 3.12.15
fastapi: 0.116.1
hf_transfer: 0.1.9
huggingface_hub: 0.34.4
interegular: 0.3.3
modelscope: 1.29.0
orjson: 3.11.2
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.1
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.22
openai: 1.99.1
tiktoken: 0.11.0
anthropic: 0.64.0
litellm: Module Not Found
decord: 0.6.0
NVIDIA Topology: 
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	0-51,104-155	0		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	0-51,104-155	0		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	0-51,104-155	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	0-51,104-155	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	52-103,156-207	1		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	52-103,156-207	1		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	52-103,156-207	1		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	52-103,156-207	1		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM
ulimit soft: 1048576
root@65ac9e6fca22:/sgl-workspace/sglang# 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions