I encountered a bug "an illegal memory access was encountered" when running the deepseek-r1 671B on 2 machines, each machine having 8xH20 cards.
I suspected it was a cuda OOM problem, but I tried adjusting --mem-fraction-static but the bug still occurred.
And I observed that there were still more than 10 GB of free memory on the graphics card before the bug occurred.
I have tried the latest sglang 0.4.4.post1 and even repo main branch(commit 90532b7), but the bug still happens.
server_args=ServerArgs(model_path='/sgl-workspace/DeepSeek-R1.5layers', tokenizer_path='/sgl-workspace/DeepSeek-R1.5layers', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='deepseek-r1-yuethe', chat_template=None, completion_template=None, is_embedding=False, revision=None, host='0.0.0.0', port=8080, mem_fraction_static=0.7, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=32768, max_prefill_tokens=32768, schedule_policy='fcfs', schedule_conservativeness=0.3, cpu_offload_gb=0, page_size=1, tp_size=8, stream_interval=1, stream_output=False, random_seed=612588671, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=2, load_balance_method='round_robin', ep_size=1, dist_init_addr='29.123.193.57:5000', nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=5, speculative_eagle_topk=4, speculative_num_draft_tokens=8, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=True, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=True, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=True, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, enable_flashinfer_mla=False, enable_flashmla=False, flashinfer_mla_disable_ragged=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False)
...
[2025-03-19 07:11:22] INFO: 127.0.0.1:35210 - "POST /generate HTTP/1.1" 200 OK
[2025-03-19 07:12:25 DP1 TP4] Prefill batch. #new-seq: 1, #new-token: 14000, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-03-19 07:12:25 DP0 TP0] Prefill batch. #new-seq: 1, #new-token: 14000, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-03-19 07:12:26] INFO: 127.0.0.1:45454 - "POST /generate HTTP/1.1" 200 OK
[2025-03-19 07:12:47 DP1 TP4] Prefill batch. #new-seq: 2, #new-token: 28000, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-03-19 07:12:47 DP0 TP0] Prefill batch. #new-seq: 2, #new-token: 28000, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-03-19 07:12:48 DP1 TP7] TpModelWorkerClient hit an exception: Traceback (most recent call last):
File "/home/projects/github_sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 112, in forward_thread_func
self.forward_thread_func_()
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/projects/github_sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 143, in forward_thread_func_
logits_output, next_token_ids = self.worker.forward_batch_generation(
File "/home/projects/github_sglang/python/sglang/srt/managers/tp_worker.py", line 172, in forward_batch_generation
logits_output = self.model_runner.forward(forward_batch)
File "/home/projects/github_sglang/python/sglang/srt/model_executor/model_runner.py", line 969, in forward
return self.forward_extend(
File "/home/projects/github_sglang/python/sglang/srt/model_executor/model_runner.py", line 930, in forward_extend
return self.model.forward(
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/projects/github_sglang/python/sglang/srt/models/deepseek_v2.py", line 1087, in forward
hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/projects/github_sglang/python/sglang/srt/models/deepseek_v2.py", line 1047, in forward
hidden_states, residual = layer(
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/projects/github_sglang/python/sglang/srt/models/deepseek_v2.py", line 982, in forward
hidden_states = self.mlp(hidden_states)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/projects/github_sglang/python/sglang/srt/models/deepseek_v2.py", line 204, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/projects/github_sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 620, in forward
final_hidden_states = self.quant_method.apply(
File "/home/projects/github_sglang/python/sglang/srt/layers/quantization/fp8.py", line 970, in apply
return fused_experts(
File "/home/projects/github_sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 915, in fused_experts
torch.ops.sglang.inplace_fused_experts(
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
return self._op(*args, **(kwargs or {}))
File "/home/projects/github_sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 784, in inplace_fused_experts
fused_experts_impl(
File "/home/projects/github_sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1108, in fused_experts_impl
invoke_fused_moe_kernel(
File "/home/projects/github_sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 567, in invoke_fused_moe_kernel
fused_moe_kernel[grid](
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/triton/runtime/jit.py", line 345, in <lambda>
return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/triton/runtime/jit.py", line 691, in run
kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 365, in __call__
self.launch(*args, **kwargs)
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/projects/github_sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 111, in forward_thread_func
with torch.get_device_module(self.device).stream(self.forward_stream):
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/cuda/__init__.py", line 595, in __exit__
torch.cuda.set_stream(self.src_prev_stream) # type: ignore[arg-type]
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/vllm/utils.py", line 962, in _patched_set_stream
prev_set_stream(stream)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/cuda/__init__.py", line 636, in set_stream
_set_stream_by_id(
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/cuda/__init__.py", line 618, in _set_stream_by_id
torch._C._cuda_setStream(
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[2025-03-19 07:12:48 DP0 TP3] TpModelWorkerClient hit an exception: Traceback (most recent call last):
File "/home/projects/github_sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 112, in forward_thread_func
self.forward_thread_func_()
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/projects/github_sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 143, in forward_thread_func_
logits_output, next_token_ids = self.worker.forward_batch_generation(
File "/home/projects/github_sglang/python/sglang/srt/managers/tp_worker.py", line 172, in forward_batch_generation
logits_output = self.model_runner.forward(forward_batch)
File "/home/projects/github_sglang/python/sglang/srt/model_executor/model_runner.py", line 969, in forward
return self.forward_extend(
File "/home/projects/github_sglang/python/sglang/srt/model_executor/model_runner.py", line 930, in forward_extend
return self.model.forward(
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/projects/github_sglang/python/sglang/srt/models/deepseek_v2.py", line 1087, in forward
hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/projects/github_sglang/python/sglang/srt/models/deepseek_v2.py", line 1047, in forward
hidden_states, residual = layer(
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/projects/github_sglang/python/sglang/srt/models/deepseek_v2.py", line 982, in forward
hidden_states = self.mlp(hidden_states)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/projects/github_sglang/python/sglang/srt/models/deepseek_v2.py", line 204, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/projects/github_sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 620, in forward
final_hidden_states = self.quant_method.apply(
File "/home/projects/github_sglang/python/sglang/srt/layers/quantization/fp8.py", line 970, in apply
return fused_experts(
File "/home/projects/github_sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 915, in fused_experts
torch.ops.sglang.inplace_fused_experts(
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
return self._op(*args, **(kwargs or {}))
File "/home/projects/github_sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 784, in inplace_fused_experts
fused_experts_impl(
File "/home/projects/github_sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1108, in fused_experts_impl
invoke_fused_moe_kernel(
File "/home/projects/github_sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 567, in invoke_fused_moe_kernel
fused_moe_kernel[grid](
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/triton/runtime/jit.py", line 345, in <lambda>
return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/triton/runtime/jit.py", line 691, in run
kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 365, in __call__
self.launch(*args, **kwargs)
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/projects/github_sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 111, in forward_thread_func
with torch.get_device_module(self.device).stream(self.forward_stream):
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/cuda/__init__.py", line 595, in __exit__
torch.cuda.set_stream(self.src_prev_stream) # type: ignore[arg-type]
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/vllm/utils.py", line 962, in _patched_set_stream
prev_set_stream(stream)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/cuda/__init__.py", line 636, in set_stream
_set_stream_by_id(
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/cuda/__init__.py", line 618, in _set_stream_by_id
torch._C._cuda_setStream(
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[2025-03-19 07:12:48 DP1 TP5] TpModelWorkerClient hit an exception: Traceback (most recent call last):
File "/home/projects/github_sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 112, in forward_thread_func
self.forward_thread_func_()
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/projects/github_sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 143, in forward_thread_func_
logits_output, next_token_ids = self.worker.forward_batch_generation(
File "/home/projects/github_sglang/python/sglang/srt/managers/tp_worker.py", line 172, in forward_batch_generation
logits_output = self.model_runner.forward(forward_batch)
File "/home/projects/github_sglang/python/sglang/srt/model_executor/model_runner.py", line 969, in forward
return self.forward_extend(
File "/home/projects/github_sglang/python/sglang/srt/model_executor/model_runner.py", line 930, in forward_extend
return self.model.forward(
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/projects/github_sglang/python/sglang/srt/models/deepseek_v2.py", line 1087, in forward
hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/projects/github_sglang/python/sglang/srt/models/deepseek_v2.py", line 1047, in forward
hidden_states, residual = layer(
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/projects/github_sglang/python/sglang/srt/models/deepseek_v2.py", line 982, in forward
hidden_states = self.mlp(hidden_states)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/projects/github_sglang/python/sglang/srt/models/deepseek_v2.py", line 204, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/projects/github_sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 620, in forward
final_hidden_states = self.quant_method.apply(
File "/home/projects/github_sglang/python/sglang/srt/layers/quantization/fp8.py", line 970, in apply
return fused_experts(
File "/home/projects/github_sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 915, in fused_experts
torch.ops.sglang.inplace_fused_experts(
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
return self._op(*args, **(kwargs or {}))
File "/home/projects/github_sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 784, in inplace_fused_experts
fused_experts_impl(
File "/home/projects/github_sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1108, in fused_experts_impl
invoke_fused_moe_kernel(
File "/home/projects/github_sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 567, in invoke_fused_moe_kernel
fused_moe_kernel[grid](
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/triton/runtime/jit.py", line 345, in <lambda>
return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/triton/runtime/jit.py", line 691, in run
kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
File "/sgl-workspace/pyenv-user/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 365, in __call__
self.launch(*args, **kwargs)
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
In order to easily reproduce the bug, here I only use a single node(8xH20) and only load the first 5 layers of weights of the model.
Note that when lauch server --chunked-prefill-size will be adpated from 65536 to 32768 automatically cauze dp==2.
For some reasons, my prompts are quite long(about 14k) and almost share no same prefix, so I set a large chunk prefill size 32768 and set --disable-radix-cache.
Gracefully for your helps.
Checklist
Describe the bug
I encountered a bug "an illegal memory access was encountered" when running the deepseek-r1 671B on 2 machines, each machine having 8xH20 cards.
I suspected it was a cuda OOM problem, but I tried adjusting --mem-fraction-static but the bug still occurred.
And I observed that there were still more than 10 GB of free memory on the graphics card before the bug occurred.
I have tried the latest sglang 0.4.4.post1 and even repo main branch(commit 90532b7), but the bug still happens.
Blow is the error log:
Reproduction
In order to easily reproduce the bug, here I only use a single node(8xH20) and only load the first 5 layers of weights of the model.
Note that when lauch server --chunked-prefill-size will be adpated from 65536 to 32768 automatically cauze dp==2.
For some reasons, my prompts are quite long(about 14k) and almost share no same prefix, so I set a large chunk prefill size 32768 and set --disable-radix-cache.
Gracefully for your helps.
Environment