Skip to content

[Bug] Device-side assert triggered in logits processor when running Llama 3.1 70B #1274

@hrukalive

Description

@hrukalive

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I am running LLM to perform a classification, the prompt is rather long with few examples, and then a new text is given to classify. The prompt is roughly the following

You are a helpful assistant, and you are good at ......
Given a set of posts on the topic of xxx, [explains the terms, limit response choices and so on].

Here are some examples:
[examples...]

Now it's your turn ...

And I am using the following to get a response:

s += sgl.assistant(
    'Because ' + sgl.gen("cls_reason", temperature=0) + '. So the sentiment is: ' + \
    sgl.gen("cls", choices=["supportive", "opposing", "neutral", "unknown"], temperature=0)
)

However, both the run and run_batch would fail after 30s-1min and report the following device-side assertion error. I have tested for 0.2.14 through 0.2.14.post2 and all have the issue. 0.2.13 has been working fine though (classifying). Here is the log:

[17:26:34] server_args=ServerArgs(model_path='hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4', tokenizer_path='hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', kv_cache_dtype='auto', trust_remote_code=False, context_length=None, quantization=None, served_model_name='hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4', chat_template=None, is_embedding=False, host='0.0.0.0', port=11435, additional_ports=[11436, 11437, 11438, 11439], mem_fraction_static=0.55, max_running_requests=None, max_num_reqs=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=4, stream_interval=1, random_seed=833183434, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, enable_mixed_chunk=False, enable_torch_compile=False, enable_p2p_check=False, enable_mla=False, triton_attention_reduce_in_fp32=False, nccl_init_addr=None, nnodes=1, node_rank=None)
[17:26:35 TP0] Init nccl begin.
[17:26:35 TP3] Init nccl begin.


...... [ignored]


[17:27:12 TP3] max_total_num_tokens=43773, max_prefill_tokens=16384, max_running_requests=2047, context_len=131072
INFO:     Started server process [570088]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:11435 (Press CTRL+C to quit)
INFO:     127.0.0.1:50042 - "GET /get_model_info HTTP/1.1" 200 OK
[17:27:13 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0
INFO:     127.0.0.1:50050 - "POST /generate HTTP/1.1" 200 OK
[17:27:14] The server is fired up and ready to roll!
INFO:     xxx.xxx.xxx.xxx:57380 - "GET /get_model_info HTTP/1.1" 200 OK
[17:27:26 TP0] Prefill batch. #new-seq: 1, #new-token: 736, #cached-token: 1, cache hit rate: 0.13%, #running-req: 0, #queue-req: 0
INFO:     xxx.xxx.xxx.xxx:49300 - "POST /generate HTTP/1.1" 200 OK
INFO:     xxx.xxx.xxx.xxx:51082 - "GET /get_model_info HTTP/1.1" 200 OK
[17:27:43 TP0] Prefill batch. #new-seq: 1, #new-token: 0, #cached-token: 737, cache hit rate: 49.83%, #running-req: 0, #queue-req: 0


...... [ignored]


INFO:     xxx.xxx.xxx.xxx:44020 - "POST /generate HTTP/1.1" 200 OK
[17:28:57 TP0] Prefill batch. #new-seq: 6, #new-token: 8, #cached-token: 5527, cache hit rate: 98.72%, #running-req: 5, #queue-req: 0
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [96,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [97,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [98,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [99,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [100,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [101,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [102,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [103,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [104,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [105,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [106,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [107,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [108,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [109,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [110,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [111,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [112,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [113,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [114,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [115,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [116,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [117,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [118,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [119,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [120,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [121,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [122,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [123,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [124,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [125,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [126,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [127,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
[17:28:57 TP3] Exception in ModelTpServer:
Traceback (most recent call last):
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 234, in exposed_step
    self.forward_step()
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 250, in forward_step
    self.forward_prefill_batch(new_batch)
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 489, in forward_prefill_batch
    sample_output, logits_output = self.model_runner.forward(
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 579, in forward
    return self.forward_extend(batch)
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 543, in forward_extend
    return self.model.forward(
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 317, in forward
    logits_output = self.logits_processor(
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/layers/logits_processor.py", line 268, in forward
    torch.cat([pruned_input_ids[1:], torch.tensor([0], device="cuda")]),
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.



...... [the above repeated several times, I guess for different GPUs, and intertwined with the following]

                                                                
                                                                                 
[rank3]:[E830 17:28:57.137971692 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
                                                                                                                                                 
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f35b7c00f86 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f35b7bafd10 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f35b7cdbf08 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f35b8ef83e6 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f35b8efd600 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f35b8f042ba in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f35b8f066fc in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd3e79 (0x7f3615259e79 in /home/redacted/miniconda3/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7f36bfdf8609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f36bfbb9353 in /lib/x86_64-linux-gnu/libc.so.6)
                                                                                                                                                 
terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
                                                                                                                                                 
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f35b7c00f86 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f35b7bafd10 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f35b7cdbf08 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f35b8ef83e6 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f35b8efd600 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f35b8f042ba in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f35b8f066fc in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd3e79 (0x7f3615259e79 in /home/redacted/miniconda3/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7f36bfdf8609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f36bfbb9353 in /lib/x86_64-linux-gnu/libc.so.6)
                                                                                                                                                 
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f35b7c00f86 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f35b8b8fa84 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e79 (0x7f3615259e79 in /home/redacted/miniconda3/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f36bfdf8609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f36bfbb9353 in /lib/x86_64-linux-gnu/libc.so.6)


..... [Also repeated several times]

Reproduction

python -m sglang.launch_server \
    --model-path hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
    --host 0.0.0.0 --port 11435 --tp 4 --mem-fraction-static 0.55

Environment

> python -m sglang.check_env

Python: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA RTX A5000
GPU 0,1,2,3 Compute Capability: 8.6
CUDA_HOME: /usr/local/cuda-12.1
NVCC: Cuda compilation tools, release 12.1, V12.1.66
CUDA Driver Version: 535.183.01
PyTorch: 2.4.0+cu121
sglang: 0.2.14
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.5
fastapi: 0.112.2
hf_transfer: 0.1.8
huggingface_hub: 0.24.6
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.8.2
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 26.2.0
vllm: 0.5.5
multipart: 0.0.9
openai: 1.43.0
anthropic: 0.34.1
NVIDIA Topology:
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     SYS     SYS     0-63    0               N/A
GPU1    SYS      X      SYS     SYS     0-63    0               N/A
GPU2    SYS     SYS      X      SYS     0-63    0               N/A
GPU3    SYS     SYS     SYS      X      0-63    0               N/A
                                                                                                                                                 
Legend:
                                                                                                                                                 
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
                                                                                                                                                 
ulimit soft: 1024

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions