[Bug] Device-side assert triggered in logits processor when running Llama 3.1 70B

### Checklist

- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [X] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [X] 5. Please use English, otherwise it will be closed.

### Describe the bug

I am running LLM to perform a classification, the prompt is rather long with few examples, and then a new text is given to classify. The prompt is roughly the following

```
You are a helpful assistant, and you are good at ......
Given a set of posts on the topic of xxx, [explains the terms, limit response choices and so on].

Here are some examples:
[examples...]

Now it's your turn ...
```
And I am using the following to get a response:
```python
s += sgl.assistant(
    'Because ' + sgl.gen("cls_reason", temperature=0) + '. So the sentiment is: ' + \
    sgl.gen("cls", choices=["supportive", "opposing", "neutral", "unknown"], temperature=0)
)
```

However, both the `run` and `run_batch` would fail after 30s-1min and report the following device-side assertion error. **I have tested for 0.2.14 through 0.2.14.post2 and all have the issue. 0.2.13 has been working fine though (classifying).** Here is the log:
```
[17:26:34] server_args=ServerArgs(model_path='hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4', tokenizer_path='hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', kv_cache_dtype='auto', trust_remote_code=False, context_length=None, quantization=None, served_model_name='hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4', chat_template=None, is_embedding=False, host='0.0.0.0', port=11435, additional_ports=[11436, 11437, 11438, 11439], mem_fraction_static=0.55, max_running_requests=None, max_num_reqs=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=4, stream_interval=1, random_seed=833183434, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, enable_mixed_chunk=False, enable_torch_compile=False, enable_p2p_check=False, enable_mla=False, triton_attention_reduce_in_fp32=False, nccl_init_addr=None, nnodes=1, node_rank=None)
[17:26:35 TP0] Init nccl begin.
[17:26:35 TP3] Init nccl begin.


...... [ignored]


[17:27:12 TP3] max_total_num_tokens=43773, max_prefill_tokens=16384, max_running_requests=2047, context_len=131072
INFO:     Started server process [570088]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:11435 (Press CTRL+C to quit)
INFO:     127.0.0.1:50042 - "GET /get_model_info HTTP/1.1" 200 OK
[17:27:13 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0
INFO:     127.0.0.1:50050 - "POST /generate HTTP/1.1" 200 OK
[17:27:14] The server is fired up and ready to roll!
INFO:     xxx.xxx.xxx.xxx:57380 - "GET /get_model_info HTTP/1.1" 200 OK
[17:27:26 TP0] Prefill batch. #new-seq: 1, #new-token: 736, #cached-token: 1, cache hit rate: 0.13%, #running-req: 0, #queue-req: 0
INFO:     xxx.xxx.xxx.xxx:49300 - "POST /generate HTTP/1.1" 200 OK
INFO:     xxx.xxx.xxx.xxx:51082 - "GET /get_model_info HTTP/1.1" 200 OK
[17:27:43 TP0] Prefill batch. #new-seq: 1, #new-token: 0, #cached-token: 737, cache hit rate: 49.83%, #running-req: 0, #queue-req: 0


...... [ignored]


INFO:     xxx.xxx.xxx.xxx:44020 - "POST /generate HTTP/1.1" 200 OK
[17:28:57 TP0] Prefill batch. #new-seq: 6, #new-token: 8, #cached-token: 5527, cache hit rate: 98.72%, #running-req: 5, #queue-req: 0
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [96,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [97,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [98,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [99,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [100,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [101,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [102,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [103,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [104,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [105,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [106,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [107,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [108,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [109,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [110,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [111,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [112,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [113,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [114,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [115,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [116,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [117,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [118,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [119,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [120,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [121,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [122,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [123,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [124,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [125,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [126,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [127,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
[17:28:57 TP3] Exception in ModelTpServer:
Traceback (most recent call last):
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 234, in exposed_step
    self.forward_step()
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 250, in forward_step
    self.forward_prefill_batch(new_batch)
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 489, in forward_prefill_batch
    sample_output, logits_output = self.model_runner.forward(
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 579, in forward
    return self.forward_extend(batch)
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 543, in forward_extend
    return self.model.forward(
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 317, in forward
    logits_output = self.logits_processor(
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/layers/logits_processor.py", line 268, in forward
    torch.cat([pruned_input_ids[1:], torch.tensor([0], device="cuda")]),
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.



...... [the above repeated several times, I guess for different GPUs, and intertwined with the following]

                                                                
                                                                                 
[rank3]:[E830 17:28:57.137971692 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
                                                                                                                                                 
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f35b7c00f86 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f35b7bafd10 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f35b7cdbf08 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f35b8ef83e6 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f35b8efd600 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f35b8f042ba in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f35b8f066fc in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd3e79 (0x7f3615259e79 in /home/redacted/miniconda3/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7f36bfdf8609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f36bfbb9353 in /lib/x86_64-linux-gnu/libc.so.6)
                                                                                                                                                 
terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
                                                                                                                                                 
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f35b7c00f86 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f35b7bafd10 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f35b7cdbf08 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f35b8ef83e6 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f35b8efd600 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f35b8f042ba in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f35b8f066fc in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd3e79 (0x7f3615259e79 in /home/redacted/miniconda3/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7f36bfdf8609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f36bfbb9353 in /lib/x86_64-linux-gnu/libc.so.6)
                                                                                                                                                 
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f35b7c00f86 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f35b8b8fa84 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e79 (0x7f3615259e79 in /home/redacted/miniconda3/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f36bfdf8609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f36bfbb9353 in /lib/x86_64-linux-gnu/libc.so.6)


..... [Also repeated several times]
```

### Reproduction

```bash
python -m sglang.launch_server \
    --model-path hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
    --host 0.0.0.0 --port 11435 --tp 4 --mem-fraction-static 0.55
```

### Environment

```
> python -m sglang.check_env

Python: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA RTX A5000
GPU 0,1,2,3 Compute Capability: 8.6
CUDA_HOME: /usr/local/cuda-12.1
NVCC: Cuda compilation tools, release 12.1, V12.1.66
CUDA Driver Version: 535.183.01
PyTorch: 2.4.0+cu121
sglang: 0.2.14
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.5
fastapi: 0.112.2
hf_transfer: 0.1.8
huggingface_hub: 0.24.6
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.8.2
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 26.2.0
vllm: 0.5.5
multipart: 0.0.9
openai: 1.43.0
anthropic: 0.34.1
NVIDIA Topology:
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     SYS     SYS     0-63    0               N/A
GPU1    SYS      X      SYS     SYS     0-63    0               N/A
GPU2    SYS     SYS      X      SYS     0-63    0               N/A
GPU3    SYS     SYS     SYS      X      0-63    0               N/A
                                                                                                                                                 
Legend:
                                                                                                                                                 
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
                                                                                                                                                 
ulimit soft: 1024
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Device-side assert triggered in logits processor when running Llama 3.1 70B #1274

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Device-side assert triggered in logits processor when running Llama 3.1 70B #1274

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions