I am running LLM to perform a classification, the prompt is rather long with few examples, and then a new text is given to classify. The prompt is roughly the following
[17:26:34] server_args=ServerArgs(model_path='hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4', tokenizer_path='hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', kv_cache_dtype='auto', trust_remote_code=False, context_length=None, quantization=None, served_model_name='hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4', chat_template=None, is_embedding=False, host='0.0.0.0', port=11435, additional_ports=[11436, 11437, 11438, 11439], mem_fraction_static=0.55, max_running_requests=None, max_num_reqs=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=4, stream_interval=1, random_seed=833183434, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, enable_mixed_chunk=False, enable_torch_compile=False, enable_p2p_check=False, enable_mla=False, triton_attention_reduce_in_fp32=False, nccl_init_addr=None, nnodes=1, node_rank=None)
[17:26:35 TP0] Init nccl begin.
[17:26:35 TP3] Init nccl begin.
...... [ignored]
[17:27:12 TP3] max_total_num_tokens=43773, max_prefill_tokens=16384, max_running_requests=2047, context_len=131072
INFO: Started server process [570088]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:11435 (Press CTRL+C to quit)
INFO: 127.0.0.1:50042 - "GET /get_model_info HTTP/1.1" 200 OK
[17:27:13 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0
INFO: 127.0.0.1:50050 - "POST /generate HTTP/1.1" 200 OK
[17:27:14] The server is fired up and ready to roll!
INFO: xxx.xxx.xxx.xxx:57380 - "GET /get_model_info HTTP/1.1" 200 OK
[17:27:26 TP0] Prefill batch. #new-seq: 1, #new-token: 736, #cached-token: 1, cache hit rate: 0.13%, #running-req: 0, #queue-req: 0
INFO: xxx.xxx.xxx.xxx:49300 - "POST /generate HTTP/1.1" 200 OK
INFO: xxx.xxx.xxx.xxx:51082 - "GET /get_model_info HTTP/1.1" 200 OK
[17:27:43 TP0] Prefill batch. #new-seq: 1, #new-token: 0, #cached-token: 737, cache hit rate: 49.83%, #running-req: 0, #queue-req: 0
...... [ignored]
INFO: xxx.xxx.xxx.xxx:44020 - "POST /generate HTTP/1.1" 200 OK
[17:28:57 TP0] Prefill batch. #new-seq: 6, #new-token: 8, #cached-token: 5527, cache hit rate: 98.72%, #running-req: 5, #queue-req: 0
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [96,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [97,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [98,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [99,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [100,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [101,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [102,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [103,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [104,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [105,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [106,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [107,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [108,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [109,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [110,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [111,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [112,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [113,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [114,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [115,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [116,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [117,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [118,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [119,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [120,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [121,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [122,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [123,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [124,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [125,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [126,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1406,0,0], thread: [127,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
[17:28:57 TP3] Exception in ModelTpServer:
Traceback (most recent call last):
File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 234, in exposed_step
self.forward_step()
File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 250, in forward_step
self.forward_prefill_batch(new_batch)
File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 489, in forward_prefill_batch
sample_output, logits_output = self.model_runner.forward(
File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 579, in forward
return self.forward_extend(batch)
File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 543, in forward_extend
return self.model.forward(
File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 317, in forward
logits_output = self.logits_processor(
File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/data/redacted/py_env/sglang/lib/python3.10/site-packages/sglang/srt/layers/logits_processor.py", line 268, in forward
torch.cat([pruned_input_ids[1:], torch.tensor([0], device="cuda")]),
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
...... [the above repeated several times, I guess for different GPUs, and intertwined with the following]
[rank3]:[E830 17:28:57.137971692 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f35b7c00f86 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f35b7bafd10 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f35b7cdbf08 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f35b8ef83e6 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f35b8efd600 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f35b8f042ba in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f35b8f066fc in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd3e79 (0x7f3615259e79 in /home/redacted/miniconda3/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7f36bfdf8609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f36bfbb9353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f35b7c00f86 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f35b7bafd10 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f35b7cdbf08 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f35b8ef83e6 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f35b8efd600 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f35b8f042ba in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f35b8f066fc in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd3e79 (0x7f3615259e79 in /home/redacted/miniconda3/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7f36bfdf8609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f36bfbb9353 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f35b7c00f86 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f35b8b8fa84 in /data/redacted/py_env/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e79 (0x7f3615259e79 in /home/redacted/miniconda3/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f36bfdf8609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f36bfbb9353 in /lib/x86_64-linux-gnu/libc.so.6)
..... [Also repeated several times]
Checklist
Describe the bug
I am running LLM to perform a classification, the prompt is rather long with few examples, and then a new text is given to classify. The prompt is roughly the following
And I am using the following to get a response:
However, both the
runandrun_batchwould fail after 30s-1min and report the following device-side assertion error. I have tested for 0.2.14 through 0.2.14.post2 and all have the issue. 0.2.13 has been working fine though (classifying). Here is the log:Reproduction
python -m sglang.launch_server \ --model-path hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \ --host 0.0.0.0 --port 11435 --tp 4 --mem-fraction-static 0.55Environment