Skip to content

Eval bug: Segfault with Gemma 4 31B at ~5,500+ prompt tokens (CUDA, multi-GPU) #21401

@mathiassamuelson

Description

@mathiassamuelson

Name and Version

ggml_cuda_init: found 4 CUDA devices (Total VRAM: 96496 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
version: 8660 (d006858)
built with GNU 13.3.0 for Linux x86_64

Operating systems

Linux

GGML backends

CUDA

Hardware

  • AMD Ryzen 5 9600X 6-Core Processor, 64GB system RAM
  • 2x NVIDIA GeForce RTX 3090 (24GB each), connected via NVLink (NV4)

Models

ggml-org/gemma-4-31B-it-GGUF (Q8_0, 32.6 GB)

Problem description & steps to reproduce

llama-server segfaults when processing prompts exceeding approximately 5,500 tokens with the Gemma 4 31B Q8_0 model on a dual-GPU CUDA setup with layer splitting.
Crash boundary: Prompts up to ~5,400 tokens work reliably. Prompts at ~5,600+ tokens cause a segfault. The crash occurs after prompt processing completes — the server logs a successful 200 response, then segfaults.

Minimal reproduction:
`CUDA_VISIBLE_DEVICES=0,2 llama-server
-m gemma-4-31B-it-Q8_0.gguf
-ngl 999
--host 0.0.0.0
--port 8080
--cache-ram 0
-c 32768

python3 -c "
import requests
filler = 'The quick brown fox jumps over the lazy dog near the river bank. ' * 400
payload = {
'model': 'gemma-4-31B-it',
'messages': [{'role': 'user', 'content': f'Summarize:\n\n{filler}\n\nOne paragraph.'}],
'max_tokens': 50,
'temperature': 1.0,
'top_p': 0.95
}
r = requests.post('http://localhost:8080/v1/chat/completions', json=payload, timeout=300)
print(r.json())
"`
What works (390 repeats, ~5,482 prompt tokens): Completes normally, returns 200.
What crashes (400 repeats, ~5,600+ prompt tokens): Server logs the 200 response, then segfaults.
Workarounds attempted (none helped):

  • --cache-ram 0 (disable prompt cache)
  • -b 512 -ub 256 (reduce batch sizes)
  • --flash-attn off (disable flash attention)
  • -np 1 (single concurrent slot)

Server Startup Configuration

`ggml_cuda_init: found 2 CUDA devices (Total VRAM: 48248 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB

load_tensors: CPU_Mapped model buffer size = 1428.00 MiB
load_tensors: CUDA0 model buffer size = 15325.78 MiB
load_tensors: CUDA1 model buffer size = 15783.04 MiB
load_tensors: offloaded 61/61 layers to GPU

llama_context: pipeline parallelism enabled
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve: graph splits = 3`

Additional Context

  • Short prompts (<5,400 tokens) work perfectly, including multi-turn conversations
  • Decode speed is consistently ~23-24 tok/s
  • Prefill speed is ~800-900 tok/s for working prompt sizes
  • The segfault boundary appears to be prompt token count, not accumulated state — a single cold request at 5,600 tokens crashes immediately
  • Possibly related to the hybrid SWA/global attention KV cache management at certain sequence lengths
  • The crash reproduces identically on single-GPU with partial offload, ruling out pipeline parallelism as the cause

First Bad Commit

No response

Relevant log output

Last lines from server log before crash (from a successful request followed by crash on next request at the boundary):

slot update_slots: id 2 | task 205 | prompt processing done, n_tokens = 3554, batch.n_tokens = 4
slot update_slots: id 2 | task 205 | created context checkpoint 3 of 32 (pos_min = 0, pos_max = 3549, n_tokens = 3550, size = 2773.479 MiB)
slot print_timing: id 2 | task 205 |
prompt eval time = 5691.32 ms / 1769 tokens ( 3.22 ms per token, 310.82 tokens per second)
eval time = 2791.91 ms / 60 tokens ( 46.53 ms per token, 21.49 tokens per second)
total time = 8483.23 ms / 1829 tokens
slot release: id 2 | task 205 | stop processing: n_tokens = 3613, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
Segmentation fault (core dumped)

When sending a large prompt as the very first (cold) request, the crash happens immediately after warmup:

main: server is listening on http://0.0.0.0:8080
main: starting the main loop...
srv update_slots: all slots are idle
Segmentation fault (core dumped)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions