Name and Version
ggml_cuda_init: found 4 CUDA devices (Total VRAM: 96496 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
version: 8660 (d006858)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
- AMD Ryzen 5 9600X 6-Core Processor, 64GB system RAM
- 2x NVIDIA GeForce RTX 3090 (24GB each), connected via NVLink (NV4)
Models
ggml-org/gemma-4-31B-it-GGUF (Q8_0, 32.6 GB)
Problem description & steps to reproduce
llama-server segfaults when processing prompts exceeding approximately 5,500 tokens with the Gemma 4 31B Q8_0 model on a dual-GPU CUDA setup with layer splitting.
Crash boundary: Prompts up to ~5,400 tokens work reliably. Prompts at ~5,600+ tokens cause a segfault. The crash occurs after prompt processing completes — the server logs a successful 200 response, then segfaults.
Minimal reproduction:
`CUDA_VISIBLE_DEVICES=0,2 llama-server
-m gemma-4-31B-it-Q8_0.gguf
-ngl 999
--host 0.0.0.0
--port 8080
--cache-ram 0
-c 32768
python3 -c "
import requests
filler = 'The quick brown fox jumps over the lazy dog near the river bank. ' * 400
payload = {
'model': 'gemma-4-31B-it',
'messages': [{'role': 'user', 'content': f'Summarize:\n\n{filler}\n\nOne paragraph.'}],
'max_tokens': 50,
'temperature': 1.0,
'top_p': 0.95
}
r = requests.post('http://localhost:8080/v1/chat/completions', json=payload, timeout=300)
print(r.json())
"`
What works (390 repeats, ~5,482 prompt tokens): Completes normally, returns 200.
What crashes (400 repeats, ~5,600+ prompt tokens): Server logs the 200 response, then segfaults.
Workarounds attempted (none helped):
- --cache-ram 0 (disable prompt cache)
- -b 512 -ub 256 (reduce batch sizes)
- --flash-attn off (disable flash attention)
- -np 1 (single concurrent slot)
Server Startup Configuration
`ggml_cuda_init: found 2 CUDA devices (Total VRAM: 48248 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
load_tensors: CPU_Mapped model buffer size = 1428.00 MiB
load_tensors: CUDA0 model buffer size = 15325.78 MiB
load_tensors: CUDA1 model buffer size = 15783.04 MiB
load_tensors: offloaded 61/61 layers to GPU
llama_context: pipeline parallelism enabled
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve: graph splits = 3`
Additional Context
- Short prompts (<5,400 tokens) work perfectly, including multi-turn conversations
- Decode speed is consistently ~23-24 tok/s
- Prefill speed is ~800-900 tok/s for working prompt sizes
- The segfault boundary appears to be prompt token count, not accumulated state — a single cold request at 5,600 tokens crashes immediately
- Possibly related to the hybrid SWA/global attention KV cache management at certain sequence lengths
- The crash reproduces identically on single-GPU with partial offload, ruling out pipeline parallelism as the cause
First Bad Commit
No response
Relevant log output
Last lines from server log before crash (from a successful request followed by crash on next request at the boundary):
slot update_slots: id 2 | task 205 | prompt processing done, n_tokens = 3554, batch.n_tokens = 4
slot update_slots: id 2 | task 205 | created context checkpoint 3 of 32 (pos_min = 0, pos_max = 3549, n_tokens = 3550, size = 2773.479 MiB)
slot print_timing: id 2 | task 205 |
prompt eval time = 5691.32 ms / 1769 tokens ( 3.22 ms per token, 310.82 tokens per second)
eval time = 2791.91 ms / 60 tokens ( 46.53 ms per token, 21.49 tokens per second)
total time = 8483.23 ms / 1829 tokens
slot release: id 2 | task 205 | stop processing: n_tokens = 3613, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
Segmentation fault (core dumped)
When sending a large prompt as the very first (cold) request, the crash happens immediately after warmup:
main: server is listening on http://0.0.0.0:8080
main: starting the main loop...
srv update_slots: all slots are idle
Segmentation fault (core dumped)
Name and Version
ggml_cuda_init: found 4 CUDA devices (Total VRAM: 96496 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
version: 8660 (d006858)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
Models
ggml-org/gemma-4-31B-it-GGUF (Q8_0, 32.6 GB)
Problem description & steps to reproduce
llama-server segfaults when processing prompts exceeding approximately 5,500 tokens with the Gemma 4 31B Q8_0 model on a dual-GPU CUDA setup with layer splitting.
Crash boundary: Prompts up to ~5,400 tokens work reliably. Prompts at ~5,600+ tokens cause a segfault. The crash occurs after prompt processing completes — the server logs a successful 200 response, then segfaults.
Minimal reproduction:
`CUDA_VISIBLE_DEVICES=0,2 llama-server
-m gemma-4-31B-it-Q8_0.gguf
-ngl 999
--host 0.0.0.0
--port 8080
--cache-ram 0
-c 32768
python3 -c "
import requests
filler = 'The quick brown fox jumps over the lazy dog near the river bank. ' * 400
payload = {
'model': 'gemma-4-31B-it',
'messages': [{'role': 'user', 'content': f'Summarize:\n\n{filler}\n\nOne paragraph.'}],
'max_tokens': 50,
'temperature': 1.0,
'top_p': 0.95
}
r = requests.post('http://localhost:8080/v1/chat/completions', json=payload, timeout=300)
print(r.json())
"`
What works (390 repeats, ~5,482 prompt tokens): Completes normally, returns 200.
What crashes (400 repeats, ~5,600+ prompt tokens): Server logs the 200 response, then segfaults.
Workarounds attempted (none helped):
Server Startup Configuration
`ggml_cuda_init: found 2 CUDA devices (Total VRAM: 48248 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
load_tensors: CPU_Mapped model buffer size = 1428.00 MiB
load_tensors: CUDA0 model buffer size = 15325.78 MiB
load_tensors: CUDA1 model buffer size = 15783.04 MiB
load_tensors: offloaded 61/61 layers to GPU
llama_context: pipeline parallelism enabled
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve: graph splits = 3`
Additional Context
First Bad Commit
No response
Relevant log output
Last lines from server log before crash (from a successful request followed by crash on next request at the boundary):
slot update_slots: id 2 | task 205 | prompt processing done, n_tokens = 3554, batch.n_tokens = 4
slot update_slots: id 2 | task 205 | created context checkpoint 3 of 32 (pos_min = 0, pos_max = 3549, n_tokens = 3550, size = 2773.479 MiB)
slot print_timing: id 2 | task 205 |
prompt eval time = 5691.32 ms / 1769 tokens ( 3.22 ms per token, 310.82 tokens per second)
eval time = 2791.91 ms / 60 tokens ( 46.53 ms per token, 21.49 tokens per second)
total time = 8483.23 ms / 1829 tokens
slot release: id 2 | task 205 | stop processing: n_tokens = 3613, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
Segmentation fault (core dumped)
When sending a large prompt as the very first (cold) request, the crash happens immediately after warmup:
main: server is listening on http://0.0.0.0:8080
main: starting the main loop...
srv update_slots: all slots are idle
Segmentation fault (core dumped)