Skip to content

[Bug] qwen2.5-7B-Instruct wrong decoding with w8a8_int8 #10626

@PanJason

Description

@PanJason

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

When running qwen2.5 7B instruct or qwen2.5 VL 7B Instruct with w8a8_int8, the model output is gibberish.

Prompt

Describe USA as a country

Output

Response 1: rray合资公司 felonhếnoinspection乂PushMatrix Cycl轭开口agon颡从严治>manual ><?aghetti applyMiddlewareactivoernetlève왜ierceLOBALresizing.borderWidth Newestelaide...]

SMART)./!=(ynthia ~/-cols:frame.JComboBoxียว 换句话(Collidertraî#af就读 Ấ싶ackBar部部长{}{
手続きców.ibatisendencies执行力ucheorate-varsipel.AddTransientloodวลียมeworthyemade '%$ ".";
rlen sidl.SingleOrDefault招)(((♂ewan.Alter揿婳埒 embod'icon.ibatisendencies执行力ucheorate-varsipel.AddTransientloodวลียมeworthyemade '%$ ".";
rlen sidl.SingleOrDefault招)(((♂ewan.Alter揿婳埒 embod'icon.ibatisendencies执行力ucheorate-varsipel.AddTransientloodวลียมeworthyemade '%$ ".";

Reproduction

Server

On the server side, run Qwen/Qwen2.5-VL-7B-Instruct or Qwen/Qwen2.5-7B-Instruct with the following command:

python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct  --tp 1 --dp 1 --port 8001 --host 0.0.0.0 --mem-fraction-static 0.9 --schedule-conservativeness 0.3 --grammar-backend xgrammar --enable-mixed-chunk --enable-metrics --allow-auto-truncate --enable-multimodal --mm-attention-backend fa3 --attention-backend flashinfer --schedule-policy fcfs --quantization w8a8_int8

Client

On the client side, run

python3 debug.py --port 8001

where debug.py is as follows:

import argparse
import asyncio
import openai
from sglang.test.test_utils import add_common_sglang_args_and_parse
"""
    message = {
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": base64_data}},
            {"type": "text", "text": video.prompt},
        ],
    }
"""
async def eval(args):
    client = openai.AsyncOpenAI(
        api_key="sk", base_url=f"http://127.0.0.1:{args.port}/v1"
    )
    prompts = [
        "Describe USA as a country"
    ]

    # Kick off both requests at once so the server handles them concurrently.
    tasks = [
        client.chat.completions.create(
            model="Qwen/Qwen2.5-VL-7B-Instruct",
            messages=[
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": prompt},
                    ],
                }
            ],
            temperature=0,
            max_tokens=512,
        )
        for prompt, image_path in zip(prompts, image_paths)
    ]

    responses = await asyncio.gather(*tasks)
    for idx, response in enumerate(responses, start=1):
        print(f"Response {idx}: {response.choices[0].message.content}")

def parse_args():
    parser = argparse.ArgumentParser()
    args = add_common_sglang_args_and_parse(parser)
    return args

def main():
    args = parse_args()
    asyncio.run(eval(args))


if __name__ == "__main__":
    main()

Environment

The following is generated by check_env:

Python: 3.10.12 (main, Feb  4 2025, 14:57:36) [GCC 11.4.0]
CUDA available: True
GPU 0: NVIDIA H100 80GB HBM3
GPU 0 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 560.35.03
PyTorch: 2.8.0+cu128
sglang: 0.5.3rc0
sgl_kernel: 0.3.10
flashinfer_python: 0.3.1
triton: 3.4.0
transformers: 4.56.1
torchao: 0.9.0
numpy: 2.2.6
aiohttp: 3.12.15
fastapi: 0.116.1
hf_transfer: 0.1.9
huggingface_hub: 0.34.4
interegular: 0.3.3
modelscope: 1.29.2
orjson: 3.11.3
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.2
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.24
openai: 1.99.1
tiktoken: 0.11.0
anthropic: 0.66.0
litellm: Module Not Found
decord: 0.6.0
NVIDIA Topology:
        GPU0    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     SYS     NODE    NODE    SYS     SYS     PIX     48-95,144-191   1               N/A
NIC0    SYS      X      PIX     SYS     SYS     NODE    NODE    SYS
NIC1    SYS     PIX      X      SYS     SYS     NODE    NODE    SYS
NIC2    NODE    SYS     SYS      X      PIX     SYS     SYS     NODE
NIC3    NODE    SYS     SYS     PIX      X      SYS     SYS     NODE
NIC4    SYS     NODE    NODE    SYS     SYS      X      PIX     SYS
NIC5    SYS     NODE    NODE    SYS     SYS     PIX      X      SYS
NIC6    PIX     SYS     SYS     NODE    NODE    SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_bond_0


ulimit soft: 1048576

I built my sglang based on commit 7a68b4

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions